Developing an Amazon Movie Scraper

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
ShortySco Offline
Fan
Posts: 329
Joined: Apr 2007
Reputation: 0
Post: #11
jelockwood Wrote:Unfortunately for me I am using Mac OS X and it seems the range of Regexp tools is not as good.

Might be of help......

I have used some tools in the past for checking my regular expressions, many of the ones i found were internet/flash based (and i, assume), platform independant, sorry, i can't remember the names/places, but google will.

Shorty
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,234
Joined: Nov 2003
Reputation: 82
Post: #12
remember; the scraper is a xml file so any special chars needs to be xml'ized for the regexp to function properly in the scraper (or rather, for the scraper xml to load correctly in the first place). this is why there are all these < & etc stuffs in the other scrapers (which i assume you use for reference)

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #13
spiff Wrote:remember; the scraper is a xml file so any special chars needs to be xml'ized for the regexp to function properly in the scraper (or rather, for the scraper xml to load correctly in the first place). this is why there are all these < & etc stuffs in the other scrapers (which i assume you use for reference)

Yep, I spotted that on the Wiki and have taken that in to consideration when trying RegExhibit. It was the fact that in some places [a-zA-Z0-9] works and others I have had to use .* that makes it difficult to know if my regex will be right. Also escaping characters like ~ (tilde), a question mark ?, a space character, and ( ) <- not actual regex but real parenthesis is confusing. These are not listed in the Wiki.

While I am at it, here are some questions I feel the Wiki does not adequately answer.

1. I am using folder names as the search criteria. The folder names include the year of the film, e.g. "Soylent Green (1973)". While IMDB works very well with all of that as a search string, Amazon does not like the year being included (with or without parenthesis) if you type that using a web-browser, suggesting it would equally not like it being sent by a scraper.

Does your typical scraper when using folder names like this, include the year when creating a search URL, or does it strip it off, if so how?

2. When a scraper looks at the search results, XBMC displays a list of titles found, but the scraper has to use the ID number to generate the URL to access the selected result. I am not clear from the Wiki where these two different steps are done, and how the results are linked. As you saw from my last post, I have found the relevant html code returned by Amazon and somewhat got regex code that can extract either the ID or the title.

(Oops, I thought I had already posted this reply, but came back later and discovered this message still open for editing in a tab in my web-browser.)
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,234
Joined: Nov 2003
Reputation: 82
Post: #14
1) i'm not completely sure but afaik its done by the scraper. if not i will remedy that
2) no you dont have to use the id. you are free to list whatever url you want in the returned results from GetSearchResults. flow is simply;

we call CreateSearchUrl with buffers 1 set to the cleaned title (url encoded) and iirc buffer 2 set to the year - if this is not so that i will fixed. xbmc grabs the url, then calls GetSearchResults with buffer 1 set to the returned html. this function returns a xml list with the obtained results. remember, that's ALL the scraper does - it is translating some general html into a fixed xml format

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #15
The only way to test a scraper is for it to work almost completely. A bit of a catch22.

Despite the very poor documentation, it seems clear enough there are three main sections to a scraper.

1. Create the Search URL
2. Process the results to list the returned movies and let a user choose one
3. For the chosen movie get the meta data fields

I am reasonably clear on CreateSearchUrl. However the documentation for GetSearchResults sucks big time, especially as I now believe it has at least one major error in the documentation. So having sat down and looked at two existing working scrapers (IMDB and FilmAffinity) and what documentation there is in the Wiki I am going to break down the steps as I so far understand them and ask for confirmation and clarification of my understanding.

I will use the FilmAffinity GetSearchResults code as the example here. So firstly here is the exact code in that scraper.

Code:
<GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="\1" dest="7">
        <expression>&lt;img src="http://www.filmaffinity.com/imgs/movies/full/[0-9]*/([0-9]*).jpg"&gt;</expression>
      </RegExp>
      <RegExp dest="5" input="$$1" output="&lt;entity&gt;&lt;title&gt;\1 (\2)&lt;/title&gt;&lt;url&gt;http://www.filmaffinity.com/en/film$$7.html&lt;/url&gt;&lt;id&gt;$$7&lt;/id&gt;&lt;/entity&gt;">
        <expression noclean="1">&lt;title&gt;([^&lt;]*)\(([0-9]*)\) - FilmAffinity</expression>
      </RegExp>
      <RegExp input="$$1" output="\1" dest="4">
        <expression noclean="1">(&lt;b&gt;&lt;a href="/en/film.*)</expression>
      </RegExp>
      <RegExp dest="5+" input="$$1" output="&lt;entity&gt;&lt;title&gt;\2 (\3)&lt;/title&gt;&lt;url&gt;http://www.filmaffinity.com/en/film\1.html&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;&lt;/entity&gt;">
        <expression repeat="yes" noclean="1,2">&lt;a href="/en/film([0-9]*).html[^&gt;]*&gt;([^&lt;]*)&lt;/a&gt;[^\(]*\(([0-9]*)</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetSearchResults>

It seems fairly clear that the first RegExp line outputs the following

Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<results>
\1
</results>

where \1 is the content of variable 1

It is not clear what the second RegExp is doing other than it seems to be related to the unique ID number of the film on the website, here is what the second one seems to translate to.

Code:
<img src="http://www.filmaffinity.com/imgs/movies/full/[0-9]*/([0-9]*).jpg">

Interestingly, while this URL is valid and loads the film thumbnail, it does not seem to exist in the results returned by FilmAffinity.

The third RegExp is returning

Code:
<entity>
   <title>\1 (\2)</title>
   <url>http://www.filmaffinity.com/es/film$$7.html</url>
   <id>$$7</id>
</entity>

Here we see a case that seems to clearly point to an error in the Wiki. The Wiki suggests

Code:
<entity>
   <title>?</title>
   <url>?</url>
   <url>?</url>
</entity>

and makes no mention of </id>?</id> at all. Note: both the IMDB and FilmAffinity scrapers do use the ID tags.

I would guess that <title>?</title> is the title of the film as returned by the website, <url>?</url> is the URL to access details for that selected film, and <id>?</id> is a unique ID number of that film as used by the website.

I have no idea what purpose the fourth RegExp has, it appears in this case to return

Code:
(<b><a href="/en/film.*)

The fifth RegExp seems almost exactly the same as the third one.

Code:
<entity>
   <title>\2 (\3)</title>
   <url>http://www.filmaffinity.com/es/film\1.html</url>
   <id>\1</id>
</entity>

So while it seems clear that the 3rd and 5th RegExp fill in the returned XML, it is still not clear to me what bit does the actual searching of the results to find the list of films returned by the website.

I would particularly like an explanation of what the second and fourth RegExp bits do.
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,234
Joined: Nov 2003
Reputation: 82
Post: #16
while i never touched the filmaffinity scraper, all of this might not be 100 correct. however, since you want to use it as a reference...

\1 is NOT the contents of variable one, it is the first selection in the regexp - big diff. $$1 is the contents of variable 1. remember that we evaluate expressions in a LIFO order, meaning we evaluate the innermost expressions first.

"the second expression" (which will be the first one that gets evaluated) grabs the film id and stick it in buffer 7. however this one will only match IF we was redirected to a perfect match, i.e. we are on a film info page. the third expression will fill buffer 5 with this match IF we have one (the $$7 is used to indicate the contents of buffer 7, just like you do input="$$1" - the contents of buffer 1.

as i have always said the wiki is NOT an authorative documentation source. the <id> tag is an optional tag that entities may or may not fill. it is really handy on sites that uses id's to form urls etc.

the fourth expression is irrelevant, it does not help anything.

the fifth expression is the real meat if we are on a search page.

when that is done we return to the first expression to evaluate that. it takes the contents of buffer 5, selects it all and sticks a <results> tag around our matches. empty <expression> tags mean select anything in input

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #17
spiff Wrote:while i never touched the filmaffinity scraper, all of this might not be 100 correct. however, since you want to use it as a reference...

\1 is NOT the contents of variable one, it is the first selection in the regexp - big diff. $$1 is the contents of variable 1. remember that we evaluate expressions in a LIFO order, meaning we evaluate the innermost expressions first.

"the second expression" (which will be the first one that gets evaluated) grabs the film id and stick it in buffer 7. however this one will only match IF we was redirected to a perfect match, i.e. we are on a film info page. the third expression will fill buffer 5 with this match IF we have one (the $$7 is used to indicate the contents of buffer 7, just like you do input="$$1" - the contents of buffer 1.

as i have always said the wiki is NOT an authorative documentation source. the <id> tag is an optional tag that entities may or may not fill. it is really handy on sites that uses id's to form urls etc.

the fourth expression is irrelevant, it does not help anything.

the fifth expression is the real meat if we are on a search page.

when that is done we return to the first expression to evaluate that. it takes the contents of buffer 5, selects it all and sticks a <results> tag around our matches. empty <expression> tags mean select anything in input

Thank you this reply has I feel helped a lot. I can now see how part of the fifth expression finds the results and then outputs the title and url etc. In the case of Amazon this needs a much longer and more complicated expression, as its results include each result in several slightly different forms and we only want to list each once. Here is just one raw result section which is in my opinion the most useful form.

Code:
<a href="http://www.amazon.com/Soylent-Green-John-Barclay/dp/B000VAHR0U/ref=sr_1_3?ie=UTF8&s=dvd&qid=1219010497&sr=1-3"><span class="srTitle">Soylent Green</span></a>
  
   ~ John Barclay, Whit Bissell, Jan Bradley,  and Chuck Connors <span class="bindingBlock">(<span class="binding">DVD</span> - 2007)</span></td></tr>

B000VAHR0U is in this example the ID number and is used to form urls to access both artwork and individual DVD info, and in this case the title is Soylent Green. Amazon on the results page often do not list the real year of the film, instead listing the year that edition of the DVD was issued, making the year useless for searching purposes. Would simply ignoring the year in the fifth expression be ok and not including it in the built title? That is just returning "Soylent Green" rather than the useless "Soylent Green (2007)".

My first guess at an expression to search with for section 5 would be

Code:
<expression repeat="yes" noclean="1,2">www.amazon.com/[a-zA-Z0-9]*/dp/([B0-9]*)/ref=sr_[0-9]_[0-9]\?ie=UTF8&amp;s=dvd&amp;qid=[0-9]*&amp;sr=[0-9]-[0-9]&quot;&gt;&lt;span class=&quot;srTitle&quot;&gt;([a-zA-Z0-9]*)&lt;/span&gt;&lt;/a&gt;</expression>

Does this look correct to you? In particular the escaping of various characters? What is the correct way to escape a ? or a space (or is this not necessary)?

Note: It is possible for film titles to begin with a number or a letter, for example "2001 A Space Odyssey".
find quote
Gaarv Offline
Junior Member
Posts: 35
Joined: May 2008
Reputation: 0
Post: #18
If executed, the regexp you gave wont work for the title, because you miss the space.

Maybe you did already, but I highly suggest you read the url pointed in the wiki : http://www.regular-expressions.info/repeat.html

Its also specified that laziness wont work, but it does to an extend ie :

Code:
[0-9A-Za-z .!#".. whatever]*
can be simply replaced by
Code:
[^<>]*
meaning every characters repetition except "<" and ">". Pretty handy as long as you have a matching pattern to end the repetion.

Above, an application with the string you provided :

Code:
www.amazon.com/[^&lt;&gt;/]*/dp/([0-9A-Z]*) etc

Im not sure why you would want to obtain title and year in one regex, if you make several ones, theres less chances of mistakes.


XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,234
Joined: Nov 2003
Reputation: 82
Post: #19
lazyness now works, we have switched parser to pcre so you can be as lazy as you want Smile

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
C-Quel Offline
Retired Team-XBMC Member
Posts: 1,378
Joined: Aug 2004
Reputation: 0
Post: #20
try this....

http://pastebin.com/m657636a8

old but with cleanup + minor changes should work

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


[Image: badge.gif]

If scraper related please always grab the latest XML relevant to the content you are trying to grab info for from this link https://xbmc.svn.sourceforge.net/svnroot...m/scrapers

System Specs:

A Computer with loads of shiny things that make a noise and bring life to my tv, and xbmc ofc :)

iNerd Store

iNerd Forum
find quote
Post Reply