Developing an Amazon Movie Scraper

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #16
while i never touched the filmaffinity scraper, all of this might not be 100 correct. however, since you want to use it as a reference...

\1 is NOT the contents of variable one, it is the first selection in the regexp - big diff. $$1 is the contents of variable 1. remember that we evaluate expressions in a LIFO order, meaning we evaluate the innermost expressions first.

"the second expression" (which will be the first one that gets evaluated) grabs the film id and stick it in buffer 7. however this one will only match IF we was redirected to a perfect match, i.e. we are on a film info page. the third expression will fill buffer 5 with this match IF we have one (the $$7 is used to indicate the contents of buffer 7, just like you do input="$$1" - the contents of buffer 1.

as i have always said the wiki is NOT an authorative documentation source. the <id> tag is an optional tag that entities may or may not fill. it is really handy on sites that uses id's to form urls etc.

the fourth expression is irrelevant, it does not help anything.

the fifth expression is the real meat if we are on a search page.

when that is done we return to the first expression to evaluate that. it takes the contents of buffer 5, selects it all and sticks a <results> tag around our matches. empty <expression> tags mean select anything in input
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #17
spiff Wrote:while i never touched the filmaffinity scraper, all of this might not be 100 correct. however, since you want to use it as a reference...

\1 is NOT the contents of variable one, it is the first selection in the regexp - big diff. $$1 is the contents of variable 1. remember that we evaluate expressions in a LIFO order, meaning we evaluate the innermost expressions first.

"the second expression" (which will be the first one that gets evaluated) grabs the film id and stick it in buffer 7. however this one will only match IF we was redirected to a perfect match, i.e. we are on a film info page. the third expression will fill buffer 5 with this match IF we have one (the $$7 is used to indicate the contents of buffer 7, just like you do input="$$1" - the contents of buffer 1.

as i have always said the wiki is NOT an authorative documentation source. the <id> tag is an optional tag that entities may or may not fill. it is really handy on sites that uses id's to form urls etc.

the fourth expression is irrelevant, it does not help anything.

the fifth expression is the real meat if we are on a search page.

when that is done we return to the first expression to evaluate that. it takes the contents of buffer 5, selects it all and sticks a <results> tag around our matches. empty <expression> tags mean select anything in input

Thank you this reply has I feel helped a lot. I can now see how part of the fifth expression finds the results and then outputs the title and url etc. In the case of Amazon this needs a much longer and more complicated expression, as its results include each result in several slightly different forms and we only want to list each once. Here is just one raw result section which is in my opinion the most useful form.

Code:
<a href="http://www.amazon.com/Soylent-Green-John-Barclay/dp/B000VAHR0U/ref=sr_1_3?ie=UTF8&s=dvd&qid=1219010497&sr=1-3"><span class="srTitle">Soylent Green</span></a>
  
   ~ John Barclay, Whit Bissell, Jan Bradley,  and Chuck Connors <span class="bindingBlock">(<span class="binding">DVD</span> - 2007)</span></td></tr>

B000VAHR0U is in this example the ID number and is used to form urls to access both artwork and individual DVD info, and in this case the title is Soylent Green. Amazon on the results page often do not list the real year of the film, instead listing the year that edition of the DVD was issued, making the year useless for searching purposes. Would simply ignoring the year in the fifth expression be ok and not including it in the built title? That is just returning "Soylent Green" rather than the useless "Soylent Green (2007)".

My first guess at an expression to search with for section 5 would be

Code:
<expression repeat="yes" noclean="1,2">www.amazon.com/[a-zA-Z0-9]*/dp/([B0-9]*)/ref=sr_[0-9]_[0-9]\?ie=UTF8&amp;s=dvd&amp;qid=[0-9]*&amp;sr=[0-9]-[0-9]&quot;&gt;&lt;span class=&quot;srTitle&quot;&gt;([a-zA-Z0-9]*)&lt;/span&gt;&lt;/a&gt;</expression>

Does this look correct to you? In particular the escaping of various characters? What is the correct way to escape a ? or a space (or is this not necessary)?

Note: It is possible for film titles to begin with a number or a letter, for example "2001 A Space Odyssey".
find quote
Gaarv Offline
Junior Member
Posts: 35
Joined: May 2008
Reputation: 0
Post: #18
If executed, the regexp you gave wont work for the title, because you miss the space.

Maybe you did already, but I highly suggest you read the url pointed in the wiki : http://www.regular-expressions.info/repeat.html

Its also specified that laziness wont work, but it does to an extend ie :

Code:
[0-9A-Za-z .!#".. whatever]*
can be simply replaced by
Code:
[^<>]*
meaning every characters repetition except "<" and ">". Pretty handy as long as you have a matching pattern to end the repetion.

Above, an application with the string you provided :

Code:
www.amazon.com/[^&lt;&gt;/]*/dp/([0-9A-Z]*) etc

Im not sure why you would want to obtain title and year in one regex, if you make several ones, theres less chances of mistakes.


XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #19
lazyness now works, we have switched parser to pcre so you can be as lazy as you want Smile
find quote
C-Quel Offline
Retired Team-Kodi Member
Posts: 1,375
Joined: Aug 2004
Reputation: 0
Post: #20
try this....

http://pastebin.com/m657636a8

old but with cleanup + minor changes should work

Zotac ID89 + 4GB + 160GB Intel SSD + Samsung UE40D7000 + DS411+II / 2 x 3TB WD RED CAVIAR (TVHeadend Package + 4 Tuners) + Fibaro HC2 Home Automation Intergration!

^^^

Fucking awesome springs to mind :)

iNerd Store

iNerd Forum
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #21
C-Quel Wrote:try this....

http://pastebin.com/m657636a8

old but with cleanup + minor changes should work

Many, many, many thanks for pointing me towards this. I felt I was getting closer to getting the GetSearchResults working but had not yet succeeded. Yours does work.

For your information I found the following results from your unmodified Amazon scraper.

1. It successfully did a CreateSearchUrl
2. It only listed one result using its GetSearchResults
3. When that result was used it only succeeded in filling in "Studio", "Runtime" and obtaining the movie thumbnail.

I have so far 'improved' it by

1. Listing the entire page of results returned by Amazon (a maximum of twelve results). This was done by adding a repeat command to your GetSearchResults.
2. Filling in the movie "Title", "Year" (the proper film year not the DVD year), and the "Plot".
3. I have very slightly changed the filename you used for the movie thumbnail to one I believe will still return a result in a very few cases yours might not. My modified filename will always return the largest available artwork (usually 500 pixels) whereas yours would only get 500 pixel tall artwork and I believe a very few DVDs may not have artwork available that big.

While I have added/changed the code to also do "Directors" and "Actors", this is not working. My currently not working approach was to have a first regex to get the block listing all the actor(s) or director(s) and then a second regex which is supposed to extract the individual names from that block.

My efforts so far can be obtained from the following link

http://homepage.mac.com/jelockwood/.Publ...ustest.zip

As far as I can see there is no available information on the Amazon product page to do MPAA rating (Amazon use a GIF and no text), nor a tagline or summary, genre, or writer. There might be a way of getting a rating (that is reader score) by using the following text align="absbottom" alt="4.5 out of 5 stars" height="12". Note: There are several entries of this text in a product page and we would always want to look only at the first.

If anyone else would like to help out it would be much appreciated. In particular getting the actor(s) and director(s) working is a priority.

For everyone's benefit this is what the block containing all the actors looks like

Code:
<li> <b>Actors:</b> <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Charlton%20Heston">Charlton Heston</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Edward%20G.%20Robinson">Edward G. Robinson</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Dick%20Van%20Patten">Dick Van Patten</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Chuck%20Connors">Chuck Connors</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Joseph%20Cotten">Joseph Cotten</a></li>

And the Directors block is virtually identical

Code:
<li> <b>Directors:</b> <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Richard%20Fleischer">Richard Fleischer</a></li>
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #22
I made good progress today and have now managed to also get "Directors", "Actors", "Rating", and "Votes" all working!

I have even partially managed to get MPAA (aka. Certification) working. I say partially because the only text to search and match for in the Amazon HTML code is in lowercase and I would prefer to return and display it in blocks caps as is more usual. Does anyone have any suggestions on how to use regex to convert to block uppercase? Currently it returns "pg" rather than the desired "PG".

With this done the scraper will be pretty much finished in that I will have done as many fields as Amazon provide. I will then do more extensive testing (with more titles), and then make a second version for Amazon UK.
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #23
jelockwood Wrote:I made good progress today and have now managed to also get "Directors", "Actors", "Rating", and "Votes" all working!

I have even partially managed to get MPAA (aka. Certification) working. I say partially because the only text to search and match for in the Amazon HTML code is in lowercase and I would prefer to return and display it in blocks caps as is more usual. Does anyone have any suggestions on how to use regex to convert to block uppercase? Currently it returns "pg" rather than the desired "PG".

With this done the scraper will be pretty much finished in that I will have done as many fields as Amazon provide. I will then do more extensive testing (with more titles), and then make a second version for Amazon UK.

(There does not seem to be a way to edit ones own posts.)

I have now come up with a way of returning MPAA ratings in uppercase, it is rather brute force as I had to write a copy of the regex for each MPAA rating meaning there are multiple copies for MPAA but only one will ever match, then rather than using the usual \1 I hard coded the text to return, e.g. PG-13.

This now looks as complete as is possible with the information Amazon provide.


Hmm, just had a thought, even when one searches Amazon US it can return UK results (and vice versa), it will list the US results first generally as a better match. However if one does search Amazon US and happen to select a UK DVD then one would presumably get a UK Certification rather than an MPAA rating. I will be able to 'fix' this by adding yet more hard coded copies for both types of certification. This will be useful anyway since I want to do a UK version as well. I will probably leave it as an exercise for the reader to do Germany, France, Australia, etc.
find quote
C-Quel Offline
Retired Team-Kodi Member
Posts: 1,375
Joined: Aug 2004
Reputation: 0
Post: #24
Keep up the good work nice to see more hands on deck Smile

Zotac ID89 + 4GB + 160GB Intel SSD + Samsung UE40D7000 + DS411+II / 2 x 3TB WD RED CAVIAR (TVHeadend Package + 4 Tuners) + Fibaro HC2 Home Automation Intergration!

^^^

Fucking awesome springs to mind :)

iNerd Store

iNerd Forum
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #25
The Amazon US scraper looks done now, and I have converted it to do Amazon UK. This was not as straight forward as it sounds as there were more differences than you think, even such minor ones as two spaces where the US one had a single space.

Despite this I have managed to get all the equivalent fields as the US one working with one exception. This is the information for the "Plot" field. Unfortunately Amazon UK make this more difficult in several ways, firstly the most appropriate block of text is much longer, longer than would be preferred. Secondly, it has html formatting mixed in with the text. As far as I can see the scraper would clean up and remove that html formatting but I am struggling to get a regex to extract the text since the embedded html code makes matching more difficult.

Here is what the raw section looks like

Code:
<b>Amazon.co.uk Review</b><br />
  While <I>Soylent Green</I> may be one of the many <a href="/exec/obidos/tg/feature/-/298867/${0}">dystopian visions</a> of the future, the film stands out because it's one of the few titles that addresses current environmental issues head on. Adapted from Harry Harrison's novel <I>Make Room, Make Room</I>, it gives us a nightmarish vision of an over-populated, polluted future on the brink of collapse--a vision that gets uncomfortably closer every year. Charlton Heston as police officer Thorn investigates a murder in between suppressing food riots and uncovers the nightmarish truth about Soylent Green, the new foodstuff being sold to the poor. <p> The film neatly combines police procedural with conspiracy thriller. Heston's scenes are counterpointed by more elegiac ones in which the centenarian Edward G Robinson as his friend Sol broods on the world he has outlived--his death in a euthanasia chamber is a gloriously lachrymose moment, which he plays to the hilt. Heston, too, is good as Thorn, a morally equivocal cop who loots the apartments of the victims whose deaths he investigates--he's a man just getting by in an impossible world. <p> <B>On the DVD:</B> <I>Soylent Green</I> on disc comes with a commentary from director Richard Fleischer, the highpoint of which is a memorable description of what it was like to work with the brilliant ailing, entirely deaf Robinson. He is joined by Leigh Taylor-Young whose work on the film as heroine led to years of serious environmentalist commitment. It has a useful contemporary making-of documentary and touching shots of Robinson's 100th birthday party with telegrams from Sinatra and others. The feature itself is presented in anamorphic widescreen with its original mono sound. --<I>Roz Kaveney</I>
  
<br /><br />

(It will be easier to view if you copy and paste it in to an editor.)

All I need is a regex for the above and then both Amazon scrapers can be released.

Note: The URL to view the original Amazon UK page that above came from is

http://www.amazon.co.uk/Soylent-Green-Ch...117&sr=1-1
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #26
This works for me:

Code:
Review</b><br />(.*)--<I
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #27
Tested with some titles at amazon.co.uk and found out that they use sometimes --<i> and --<I> at the end of the review.

Try this regex and clean it afterwards:
Code:
Review</b><br />(.*?)--<
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #28
w00dst0ck Wrote:Tested with some titles at amazon.co.uk and found out that they use sometimes --<i> and --<I> at the end of the review.

Try this regex and clean it afterwards:
Code:
Review</b><br />(.*?)--<

Sorry for the delay in replying, been busy doing other things. Your code above does indeed seem to work, I had however before I came back and checked come up with this which also seems to work

Code:
<b>Amazon.co.uk Review</b><br />\n  ([^\n]*)

I could strip off the author bit the same way as you like so

Code:
<b>Amazon.co.uk Review</b><br />\n  ([^\n]*)--<

However I would not be surprised if there are some entries with no author listed and then it would fail to match.

Let me know if you see any drawbacks from mine.
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #29
Do you have an example url for that?
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Rainbow  Amazon Scrapers now available to download
Post: #30
I finally got round to setting up a download page for the Amazon video Scrapers I have written (as discussed in this thread).

There is one for Amazon.com and a matching one for Amazon.co.uk, just to make it clear these are Video scrapers, not music or TV scrapers. You can download them as a Zip file from the following link, the Zip file includes the matching Scraper logos as well.

These have been tested on a Mac but should work on all XBMC platforms.

Note: I still recommend people normally should use the standard IMDB scraper first, but if like me, you have some DVDs which IMDB does not list (e.g. Simpsons Christmas Special) then these scrapers will help.

Here is the download link http://homepage.mac.com/jelockwood/scrapers.html
find quote