[RELEASE] FilmAffinity (Spanish) scraper

  Thread Rating:
  • 2 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
pancheto Offline
Junior Member
Posts: 32
Joined: Nov 2011
Reputation: 0
Location: Santiago de Compostela
Post: #361
I'm good at regular expressions due to my job, so I'm considering going into this scrapers world to understand how variables are really working, which is what is stopping me from modifying this one.

the year option was the first thing I thought when I came to open the scraper code, but since I thought that \1 contained all the file name and that the year could not be splitted from it I stopped from continuing. if you give me some hint how to split year from name I would try learning scraper coding and maybe try contributing to this great scraper.
find quote
pancheto Offline
Junior Member
Posts: 32
Joined: Nov 2011
Reputation: 0
Location: Santiago de Compostela
Post: #362
after revising the code, I've came with a few updates that improve the scraper's work. since it's still in alpha version (as I'm not yet into the proper scraper coding yet), I will only post here the line changes for anyone to test them. the first one is the main change, which works fantastically (is able get all filmaffinity's info from ~700 file names without any error), and the other one is just a suggestion to bypass google search.

Nod line 11:
Code:
<RegExp input="$$1" output="&lt;url&gt;http://www.filmaffinity.com/es/advsearch.php?stype[]=title&fromyear=$$2&toyear=$$2&stext=\1&lt;/url&gt;" dest="3">
instead of
Code:
<RegExp input="$$1" output="&lt;url&gt;http://www.filmaffinity.com/es/search.php?stext=\1&amp;amp;stype=none&lt;/url&gt;" dest="3">
allows pointing to the exact movie if your file names are tagged always with "(year)" just before the file extension, as the xbmc library convention states.

Confused lines 124 to 138:
Code:
            <RegExp input="$$9" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.imdb.com/search/title?year=$$6&title=$$9&lt;/url&gt;" dest="5+">
                <RegExp input="$$8" output="+\1" dest="9">
                    <RegExp input="$$1" output="\1" dest="8">
                        <expression>&lt;th&gt;T&amp;Iacute\;TULO ORIGINAL&lt;/th&gt;\s*&lt;td&gt;&lt;strong&gt;([a-z0-9\ ]*)</expression>
                    </RegExp>
                    <expression repeat="yes">([^ ,]+)</expression>
                </RegExp>
                <RegExp input="$$6" output="+\1" dest="7">
                    <RegExp input="$$1" output="\1" dest="6">
                        <expression>&lt;th&gt;A&Ntilde;O&lt;/th&gt;\s*&lt;td&gt;.*([0-9]{4}).*&lt;/td&gt;\s*&lt;/tr&gt;\s*&lt;tr&gt;\s*&lt;th&gt;DURACI&Oacute;N&lt;/th&gt;</expression>
                    </RegExp>
                    <expression>\s*([0-9]{4})\s*</expression>
                </RegExp>
                <expression />
            </RegExp>
instead of
Code:
            <RegExp input="$$9" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.google.com/search?q=site:imdb.com\1&lt;/url&gt;" dest="5+">
                <RegExp input="$$8" output="+\1" dest="9">
                    <RegExp input="$$1" output="\1" dest="8">
                        <expression>&lt;th&gt;T&amp;Iacute\;TULO ORIGINAL&lt;/th&gt;\s*&lt;td&gt;&lt;strong&gt;([^&lt;]*)&lt;/strong&gt;&lt;/td&gt;</expression>
                    </RegExp>
                    <expression repeat="yes">([^ ,]+)</expression>
                </RegExp>
                <RegExp input="$$6" output="+\1" dest="9+">
                    <RegExp input="$$1" output="\1" dest="6">
                        <expression>&lt;th&gt;A&Ntilde;O&lt;/th&gt;\s*&lt;td&gt;(.*)&lt;/td&gt;\s*&lt;/tr&gt;\s*&lt;tr&gt;\s*&lt;th&gt;DURACI&Oacute;N&lt;/th&gt;</expression>
                    </RegExp>
                    <expression>\s*([0-9]{4})\s*</expression>
                </RegExp>
                <expression />
            </RegExp>
allows using directly IMDB's search engine, with the ability of pointing again to the exact year (hence ideally to the exact movie). the counterpart is that IMDB's advanced search is not as intelligent as Google's, and while the stress is done in the year we lose title text flexibility (you won't get "wolfman" entries when searching for "wolf man" through IMDB). this flexibility, plus the need to escape the second RegExp buffer of destiny (7 instead of 9+) just because I needed to leave buffer 9 untouched (and didn't know how to), should be discussed before getting this code into a beta version to test. also, it would be great to have a conditional to run google's search when imdb's returns no results, so we could have the best of both worlds. unfortunately I haven't mastered scraper coding that much! Nerd

Eeknote:
the search suggestion is only trying to bypass google search not because it isn't performing well (on the contrary, it does perform much better than my suggestion), but because when google detects a few hundreds of searches with the same structure and coming from the same IP it blocks the results page (after ~200 automatic searches it requests a captcha solving) and for that reason updating an entire library of hundreds of files is very complicated. I have tried to bypass http://www.google.com/ queries using http://www.google.es/ or even https://www.google.com/, but without luck.
(This post was last modified: 2011-11-21 03:10 by pancheto.)
find quote
pancheto Offline
Junior Member
Posts: 32
Joined: Nov 2011
Reputation: 0
Location: Santiago de Compostela
Post: #363
forget about my previous post, as I finally worked out how to bypass google's search limitations, which were stopping me from batch updating my entire library. in summary, the modifications I suggest to current filmaffinity's v1.4.1 scraper are only 2, very simple yet very useful ones:
  1. include year on filmaffinity search Eek
    this allows a more refined search, which will point to the exact movie if your library file names follow XBMC file naming convention (which is broadly filename(year).ext). it will still search exactly as before if year is not entered, so the ones that bothered to label files appropriately will get the bonus, and the ones that didn't just won't. simply change regular expresion line 11 to this:
    Code:
    <RegExp input="$$1" output="&lt;url&gt;http://www.filmaffinity.com/es/advsearch.php?stype[]=title&fromyear=$$2&toyear=$$2&stext=\1&lt;/url&gt;" dest="3">
  2. remove google keyword "site:" from IMDB id resolving Eek
    this was activating some kind of special search tracking at google side, which was generating an undetectable through xbmc captcha solving after a couple of hundreds of queries. I came out changing "q=site:imdb.com\1" for "q=imdb\1" finding no performance lost at all. symply change the regular expression line 124 to this:
    Code:
    <RegExp input="$$9" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.google.com/search?q=imdb\1&btnI=745&pws=0&lt;/url&gt;" dest="5+">
    I added the "&btnI=745" code (forcing google's "I'm feeling lucky" capability) in case it could lighter the query parsing, and the "&pws=0" code (remove all local personal configurations) in case anything stored locally at the browser could be stopping the scraper from reaching its destiny. I'm pretty sure these 2 options are not necessary, but since they were there in all my tests and they worked fine, I would suggest just to leave them there as I did.

I tested this modification with a ~700 file names library and it worked like a charm, pointing always to the perfect filmaffinity movie and getting >90% of the movies' fanarts.Nod

some future improvement? sure there is plenty, but what it came to me as obvious was the fact that some miniatures were not being downloaded appropriately from filmaffinity. I can download them manually through xbmc as it will suggest imdb's ones, but I was wondering why movies like Nixon (http://www.filmaffinity.com/es/film737736.html) don't get such miniature. in fact there's no lightbox overthem so it looks like the img code will surely look different. I guess I'll leave this for the proper scraper developers, in order to debug it and release a new scraper version including my 2 previous suggestions. Wink
find quote
MaDDoGo Offline
Senior Member
Posts: 242
Joined: Sep 2009
Reputation: 1
Location: Sabadell (Barcelona)
Post: #364
Hi,

I looked at your modifications and (after adjacome tests) I merged it into the repo so the scraper is enhanced with your modifications. Thanks for your time and the modifications.

[Image: widget]
find quote
pancheto Offline
Junior Member
Posts: 32
Joined: Nov 2011
Reputation: 0
Location: Santiago de Compostela
Post: #365
I have implemented a few improvements on filmaffinity poster searching at github, hoping that you'll find them useful.
(This post was last modified: 2011-11-24 02:34 by pancheto.)
find quote
itombs Offline
Senior Member
Posts: 142
Joined: Oct 2008
Reputation: 0
Post: #366
Hi, the number of votes is not working since a few days.
I updated to the MaDDoGo github version but not work with number of votes.
Please, fix the number of votes.
Thanks a lot.
find quote
pancheto Offline
Junior Member
Posts: 32
Joined: Nov 2011
Reputation: 0
Location: Santiago de Compostela
Post: #367
the scraper code looks for the number of votes between brackets, although now (filmaffinity is currently working on its look) it appears without them. I'll report this to MaDDoGo hoping to have it solved in the next version.
find quote
itombs Offline
Senior Member
Posts: 142
Joined: Oct 2008
Reputation: 0
Post: #368
When could be fixed the problem with number of votes?
There is news about this?
find quote
pancheto Offline
Junior Member
Posts: 32
Joined: Nov 2011
Reputation: 0
Location: Santiago de Compostela
Post: #369
the fix has been already submitted to XBMC's main repository, and you should see it updated on your system as version 1.4.3. if the addon doesn't get updated automatically try doing it manually, or downloading the code from MaDDoGo's github repository.
find quote
itombs Offline
Senior Member
Posts: 142
Joined: Oct 2008
Reputation: 0
Post: #370
pancheto Wrote:the fix has been already submitted to XBMC's main repository, and you should see it updated on your system as version 1.4.3. if the addon doesn't get updated automatically try doing it manually, or downloading the code from MaDDoGo's github repository.

Thanks a lot. Works well.
See you.
find quote
Post Reply