Login at Kodi Home

pancheto · 2011-11-19, 11:08

I'm good at regular expressions due to my job, so I'm considering going into this scrapers world to understand how variables are really working, which is what is stopping me from modifying this one.

the year option was the first thing I thought when I came to open the scraper code, but since I thought that \1 contained all the file name and that the year could not be splitted from it I stopped from continuing. if you give me some hint how to split year from name I would try learning scraper coding and maybe try contributing to this great scraper.

pancheto · (This post was last modified: 2011-11-21, 03:10 by pancheto.)

after revising the code, I've came with a few updates that improve the scraper's work. since it's still in alpha version (as I'm not yet into the proper scraper coding yet), I will only post here the line changes for anyone to test them. the first one is the main change, which works fantastically (is able get all filmaffinity's info from ~700 file names without any error), and the other one is just a suggestion to bypass google search.

Nod

line 11:

Code:
<RegExp input="$$1" output="&lt;url&gt;http://www.filmaffinity.com/es/advsearch.php?stype[]=title&fromyear=$$2&toyear=$$2&stext=\1&lt;/url&gt;" dest="3">

instead of

Code:
<RegExp input="$$1" output="&lt;url&gt;http://www.filmaffinity.com/es/search.php?stext=\1&amp;amp;stype=none&lt;/url&gt;" dest="3">

allows pointing to the exact movie if your file names are tagged always with "(year)" just before the file extension, as the xbmc library convention states.

Huh

lines 124 to 138:

Code:
            <RegExp input="$$9" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.imdb.com/search/title?year=$$6&title=$$9&lt;/url&gt;" dest="5+">

                <RegExp input="$$8" output="+\1" dest="9">

                    <RegExp input="$$1" output="\1" dest="8">

                        <expression>&lt;th&gt;T&amp;Iacute\;TULO ORIGINAL&lt;/th&gt;\s*&lt;td&gt;&lt;strong&gt;([a-z0-9\ ]*)</expression>

                    </RegExp>

                    <expression repeat="yes">([^ ,]+)</expression>

                </RegExp>

                <RegExp input="$$6" output="+\1" dest="7">

                    <RegExp input="$$1" output="\1" dest="6">

                        <expression>&lt;th&gt;A&Ntilde;O&lt;/th&gt;\s*&lt;td&gt;.*([0-9]{4}).*&lt;/td&gt;\s*&lt;/tr&gt;\s*&lt;tr&gt;\s*&lt;th&gt;DURACI&Oacute;N&lt;/th&gt;</expression>

                    </RegExp>

                    <expression>\s*([0-9]{4})\s*</expression>

                </RegExp>

                <expression />

            </RegExp>

instead of

Code:
            <RegExp input="$$9" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.google.com/search?q=site:imdb.com\1&lt;/url&gt;" dest="5+">

                <RegExp input="$$8" output="+\1" dest="9">

                    <RegExp input="$$1" output="\1" dest="8">

                        <expression>&lt;th&gt;T&amp;Iacute\;TULO ORIGINAL&lt;/th&gt;\s*&lt;td&gt;&lt;strong&gt;([^&lt;]*)&lt;/strong&gt;&lt;/td&gt;</expression>

                    </RegExp>

                    <expression repeat="yes">([^ ,]+)</expression>

                </RegExp>

                <RegExp input="$$6" output="+\1" dest="9+">

                    <RegExp input="$$1" output="\1" dest="6">

                        <expression>&lt;th&gt;A&Ntilde;O&lt;/th&gt;\s*&lt;td&gt;(.*)&lt;/td&gt;\s*&lt;/tr&gt;\s*&lt;tr&gt;\s*&lt;th&gt;DURACI&Oacute;N&lt;/th&gt;</expression>

                    </RegExp>

                    <expression>\s*([0-9]{4})\s*</expression>

                </RegExp>

                <expression />

            </RegExp>

allows using directly IMDB's search engine, with the ability of pointing again to the exact year (hence ideally to the exact movie). the counterpart is that IMDB's advanced search is not as intelligent as Google's, and while the stress is done in the year we lose title text flexibility (you won't get "wolfman" entries when searching for "wolf man" through IMDB). this flexibility, plus the need to escape the second RegExp buffer of destiny (7 instead of 9+) just because I needed to leave buffer 9 untouched (and didn't know how to), should be discussed before getting this code into a beta version to test. also, it would be great to have a conditional to run google's search when imdb's returns no results, so we could have the best of both worlds. unfortunately I haven't mastered scraper coding that much! Nerd

note:
the search suggestion is only trying to bypass google search not because it isn't performing well (on the contrary, it does perform much better than my suggestion), but because when google detects a few hundreds of searches with the same structure and coming from the same IP it blocks the results page (after ~200 automatic searches it requests a captcha solving) and for that reason updating an entire library of hundreds of files is very complicated. I have tried to bypass http://www.google.com/ queries using http://www.google.es/ or even https://www.google.com/, but without luck.

pancheto · 2011-11-21, 17:19

forget about my previous post, as I finally worked out how to bypass google's search limitations, which were stopping me from batch updating my entire library. in summary, the modifications I suggest to current filmaffinity's v1.4.1 scraper are only 2, very simple yet very useful ones:

include year on filmaffinity search
this allows a more refined search, which will point to the exact movie if your library file names follow XBMC file naming convention (which is broadly filename(year).ext). it will still search exactly as before if year is not entered, so the ones that bothered to label files appropriately will get the bonus, and the ones that didn't just won't. simply change regular expresion line 11 to this:

Code:
<RegExp input="$$1" output="<url>http://www.filmaffinity.com/es/advsearch.php?stype[]=title&fromyear=$$2&toyear=$$2&stext=\1</url>" dest="3">
remove google keyword "site:" from IMDB id resolving
this was activating some kind of special search tracking at google side, which was generating an undetectable through xbmc captcha solving after a couple of hundreds of queries. I came out changing "q=site:imdb.com\1" for "q=imdb\1" finding no performance lost at all. symply change the regular expression line 124 to this:

Code:
<RegExp input="$$9" output="<url function="GoogleToIMDB">http://www.google.com/search?q=imdb\1&btnI=745&pws=0</url>" dest="5+">
I added the "&btnI=745" code (forcing google's "I'm feeling lucky" capability) in case it could lighter the query parsing, and the "&pws=0" code (remove all local personal configurations) in case anything stored locally at the browser could be stopping the scraper from reaching its destiny. I'm pretty sure these 2 options are not necessary, but since they were there in all my tests and they worked fine, I would suggest just to leave them there as I did.

I tested this modification with a ~700 file names library and it worked like a charm, pointing always to the perfect filmaffinity movie and getting >90% of the movies' fanarts. Nod

some future improvement? sure there is plenty, but what it came to me as obvious was the fact that some miniatures were not being downloaded appropriately from filmaffinity. I can download them manually through xbmc as it will suggest imdb's ones, but I was wondering why movies like Nixon (http://www.filmaffinity.com/es/film737736.html) don't get such miniature. in fact there's no lightbox overthem so it looks like the img code will surely look different. I guess I'll leave this for the proper scraper developers, in order to debug it and release a new scraper version including my 2 previous suggestions. Wink

MaDDoGo · 2011-11-22, 22:58

Hi,

I looked at your modifications and (after adjacome tests) I merged it into the repo so the scraper is enhanced with your modifications. Thanks for your time and the modifications.

pancheto · (This post was last modified: 2011-11-24, 02:34 by pancheto.)

I have implemented a few improvements on filmaffinity poster searching at github, hoping that you'll find them useful.

itombs · 2011-11-25, 23:23

Hi, the number of votes is not working since a few days.
I updated to the MaDDoGo github version but not work with number of votes.
Please, fix the number of votes.
Thanks a lot.

pancheto · 2011-11-26, 01:32

the scraper code looks for the number of votes between brackets, although now (filmaffinity is currently working on its look) it appears without them. I'll report this to MaDDoGo hoping to have it solved in the next version.

itombs · 2011-12-01, 19:02

When could be fixed the problem with number of votes?
There is news about this?

pancheto · 2011-12-01, 21:40

the fix has been already submitted to XBMC's main repository, and you should see it updated on your system as version 1.4.3. if the addon doesn't get updated automatically try doing it manually, or downloading the code from MaDDoGo's github repository.

itombs · 2011-12-02, 01:43

pancheto Wrote:the fix has been already submitted to XBMC's main repository, and you should see it updated on your system as version 1.4.3. if the addon doesn't get updated automatically try doing it manually, or downloading the code from MaDDoGo's github repository.

Thanks a lot. Works well.
See you.

serieofilo · 2012-01-01, 20:15

Hello,

I've been using FA scraper for sometime and I've found the following inconsistency when downloading information of movies with english/spanish names between () but only when the movie is not a video file in the disk but a pointer to a DVD disc.

The problem is that the title has extra blank spaces between the movie name and the english/spanish name between () but only if the movie is a .disc file (a pointer for an external DVD disc).

For example, this pointer to a DVD file is getting incorrect information, 3 spaces between Jersey and the (:

Una chica de Jersey (Jersey Girl) (2004).dvd.disc

Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

<movie>

    <title>Una chica de Jersey   (Jersey Girl)</title>

    <originaltitle>Jersey Girl</originaltitle>

This one is getting good information, only 1 space between the name and the (:

The Reader (El lector) (2008).avi

Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

<movie>

    <title>The Reader (El lector)</title>

    <originaltitle>The Reader</originaltitle>

Any idea about what's the problem?

Thank you.

pancheto · 2012-01-02, 19:09

I've tried to replicate your issue, and although I've found some interesting things I haven't found any problem on the scrapper since the search results don't depend at all on the filename or its extension, but on the film entry's filmaffinity code itself.

the problem with Jersey Girl in particular, regardless its file and filename nature, is that its filmaffinity film entry has indeed 3 spaces on its title:

Code:
<title>Una chica de Jersey   (Jersey Girl) (2004) - FilmAffinity</title

the good news are that the scrapper works fine, and that you're getting the right information stored on XBMC's database, so I would consider this as a minor bug rather than an error. but if I understand properly you are suggesting to correct wrong filmaffinity data on the fly as being parsed, so since I don't know how to do so (looks really simple, although I'm still new at scraper coding) I'll report this issue on github in case any other coder may be able to include it in a future scraper version.

pancheto · 2012-01-11, 19:53

I've just addressed this problem (the extra spaces coming from FilmAffinity) on github. I'm sure that it'll soon be commited on the master branch, and then commited on the official XBMC repository.

tonybeccar · 2012-03-28, 06:48

Hello, I've been using the FA scraper since I have XBMC, and one feature that I now see available in the imdb scraper is the only thing that IMO this script is missing!! The imdb script has the option to scrape the movie title based on a predefined country. This is really useful for me and I assume for many others, because if a person doesn't live in Spain, some titles may become confusing for the user, resulting in renaming lots of movies by hand..

So, I'm asking, would it be possible to include this feature in the FA scraper? Maybe a copy paste of the IMDB scraper?

Thanks in advance!!

pancheto · 2012-03-28, 09:52

the main idea behind the FA scraper is indeed to search in FA. although the scraper tries to enrich those search results with information from other sources (such as IMDB), the way to locate an entry on FA can only be done by its spanish title (logical, since this is a spanish community) or by its original title. I know agjacome is working on a way to leave the original title stored on XBMC db instead of the spanish one, but I don't think looking for other title language should be the aim for FA scraper.