The only way to test a scraper is for it to work almost completely. A bit of a catch22.
Despite the very poor documentation, it seems clear enough there are three main sections to a scraper.
1. Create the Search URL
2. Process the results to list the returned movies and let a user choose one
3. For the chosen movie get the meta data fields
I am reasonably clear on CreateSearchUrl. However the documentation for GetSearchResults sucks big time, especially as I now believe it has at least one major error in the documentation. So having sat down and looked at two existing working scrapers (IMDB and FilmAffinity) and what documentation there is in the Wiki I am going to break down the steps as I so far understand them and ask for confirmation and clarification of my understanding.
I will use the FilmAffinity GetSearchResults code as the example here. So firstly here is the exact code in that scraper.
Code:
<GetSearchResults dest="8">
<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
<RegExp input="$$1" output="\1" dest="7">
<expression><img src="http://www.filmaffinity.com/imgs/movies/full/[0-9]*/([0-9]*).jpg"></expression>
</RegExp>
<RegExp dest="5" input="$$1" output="<entity><title>\1 (\2)</title><url>http://www.filmaffinity.com/en/film$$7.html</url><id>$$7</id></entity>">
<expression noclean="1"><title>([^<]*)\(([0-9]*)\) - FilmAffinity</expression>
</RegExp>
<RegExp input="$$1" output="\1" dest="4">
<expression noclean="1">(<b><a href="/en/film.*)</expression>
</RegExp>
<RegExp dest="5+" input="$$1" output="<entity><title>\2 (\3)</title><url>http://www.filmaffinity.com/en/film\1.html</url><id>\1</id></entity>">
<expression repeat="yes" noclean="1,2"><a href="/en/film([0-9]*).html[^>]*>([^<]*)</a>[^\(]*\(([0-9]*)</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetSearchResults>
It seems fairly clear that the first RegExp line outputs the following
Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<results>
\1
</results>
where \1 is the content of variable 1
It is not clear what the second RegExp is doing other than it seems to be related to the unique ID number of the film on the website, here is what the second one seems to translate to.
Code:
<img src="http://www.filmaffinity.com/imgs/movies/full/[0-9]*/([0-9]*).jpg">
Interestingly, while this URL is valid and loads the film thumbnail, it does not seem to exist in the results returned by FilmAffinity.
The third RegExp is returning
Code:
<entity>
<title>\1 (\2)</title>
<url>http://www.filmaffinity.com/es/film$$7.html</url>
<id>$$7</id>
</entity>
Here we see a case that seems to clearly point to an error in the Wiki. The Wiki suggests
Code:
<entity>
<title>?</title>
<url>?</url>
<url>?</url>
</entity>
and makes no mention of </id>?</id> at all. Note: both the IMDB and FilmAffinity scrapers do use the ID tags.
I would guess that <title>?</title> is the title of the film as returned by the website, <url>?</url> is the URL to access details for that selected film, and <id>?</id> is a unique ID number of that film as used by the website.
I have no idea what purpose the fourth RegExp has, it appears in this case to return
Code:
(<b><a href="/en/film.*)
The fifth RegExp seems almost exactly the same as the third one.
Code:
<entity>
<title>\2 (\3)</title>
<url>http://www.filmaffinity.com/es/film\1.html</url>
<id>\1</id>
</entity>
So while it seems clear that the 3rd and 5th RegExp fill in the returned XML, it is still not clear to me what bit does the actual searching of the results to find the list of films returned by the website.
I would particularly like an explanation of what the second and fourth RegExp bits do.