So I got tired of the existing scrapers returning incorrect results for about 1/3rd of my movies. It turns out that while the sites we scrape have lots of great data, their search engines range from inaccurate (imdb's) to completely broken (tmdb's). They both choke when your naming doesn't exactly match the official name of the movie and really don't like foreign movies.
Any real search engine can handle these cases just fine. Looking around I found that Bing has a very nice, easy to use developer API for accessing their search results. Google and Yahoo both also have APIs, but they are only for use as part of an AJAX website (Google's FAQ says they'll block you if you scrape their results). The Bing ToU allows "end-user-facing website or application".
Anyway, I edited the existing IMDB scraper to do a Bing search of "site:imdb.com movie (year)" and parse the returned XML. The actual data is still scraped from IMDB, I just changed the search part. For my collection of ~200 movies (hollywood, anime, foreign, etc) Bing got the correct imdb link 100% of the time.
This method could be used for any scraper by replacing "site:imdb.com" with "site:themoviedb.org" or something else.
Question for an XBMC admin: Bing's API requires a AppID, just like TMDB's. I signed up for one personally but would rather not release the scraper using my AppID. Would it be possible for someone @xbmc.org to sign up for an official AppID that can be used? It's an online process that takes about 5 minutes.
Once the AppID is squared away I'll release my bing_imdb.xml file, and post a little guide for how to add bing to other scrapers.
rufus210
Junior Member Posts: 13 Joined: Jan 2010 Reputation: 0 |
|
| find quote |
pike
Project Manager Joined: Sep 2003 Reputation: 28 Location: Sweden |
2010-01-06 16:00
Post: #2
I made this appid: 16E50AB9947899C41433EB944C60174737855036
Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. ![]() |
| find quote |
takoi
Fan Posts: 506 Joined: Oct 2009 Reputation: 6 Location: Norway |
2010-01-07 15:05
Post: #3
you dont need api to scrape you know. if im not mistaken the imdb scraper did use google before. i've tested google and its not perfect. there are pages missing and it can have problems with aka titles. whether google is better than imdb all depends on which movies you have i guess. maybe you should try this before completely rule out imdb.
and btw google will only block you for small amount of time if you do like 500 searches in <1min.
(This post was last modified: 2010-01-07 15:09 by takoi.)
|
| find quote |
rufus210
Junior Member Posts: 13 Joined: Jan 2010 Reputation: 0 |
2010-01-08 07:10
Post: #4
Thanks for the AppID Pike. Here are the 2 files for my modification of the imdb scraper. Drop them into the system/scrapers/video directory of your install.
http://www.hackish.org/~rufus/bing_imdb.xml http://www.hackish.org/~rufus/bing_imdb.png Can people try it out and report how it works? It works perfectly for me. To use this for other scrapers simply replace the CreateSearchUrl and GetSearchResults parts with the one from my file. In the CreateSearchUrl replace site:imdb.com with site:somethingelse.com. In the GetSearchResults replace "http://www.imdb.com/title/(tt[0-9]+)/?" with the full url you're expecting. |
| find quote |
rufus210
Junior Member Posts: 13 Joined: Jan 2010 Reputation: 0 |
2010-01-08 07:17
Post: #5
Also I agree that turning on sorted="yes" would vastly improve the default IMDB scraper's results. IMDB's search is doing far more intelligent sorting than XBMC's basic string comparison.
|
| find quote |
jmarshall
Team-XBMC Developer Posts: 24,523 Joined: Oct 2003 Reputation: 138 |
2010-01-09 00:51
Post: #6
I've said this about 100 times.
Change the imdb scraper to return ALL titles that the search brings, rather than just the links. The problem is that it's not returning the AKA title names that the search page gives. eg Infernal Affairs Should include both "The Departed" and "Infernal Affairs" results, both linking to the same movie. That way the fuzzy matching in XBMC will work perfectly, _enhancing_ the IMDb results for those cases where it doesn't work well. Can you find any other cases where the IMDb search would then require sorted="yes" ? Cheers, Jonathan Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. ![]() |
| find quote |
snoop2048
Junior Member Posts: 22 Joined: Dec 2009 Reputation: 0 |
2010-01-10 17:29
Post: #7
Hi, What changes to I need to make to get IMDB to return all titles?
I am having this issue as its not returning the AKA titles. Cheers Si |
| find quote |
jmarshall
Team-XBMC Developer Posts: 24,523 Joined: Oct 2003 Reputation: 138 |
2010-01-10 22:56
Post: #8
You need to modify the scraper XML file. This involves parsing the results IMDb gives using regular expressions to generate a set of XML results that XBMC then uses.
Cheers, Jonathan Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. ![]() |
| find quote |
rufus210
Junior Member Posts: 13 Joined: Jan 2010 Reputation: 0 |
2010-01-11 04:53
Post: #9
So I did some quick tests using my Anime collection. Here are the 13 movies I used for testing:
Code: Akira-1988My Bing+IMDB scraper gets all 13/13 correct. The default IMDB scraper (unsorted) gets only 7/13. The ones it gets wrong: Code: Fullmetal.Alchemist_the.Conqueror.of.Shamballa-2005Using the default IMDB scrapper and adding the sorted="yes" tag gets 12/13. It can't find this one at all (doesn't even get a wrong hit): Code: Ghost.in.the.Shell_Innocence-2004jmarshall: why bother adding a bunch more code to scrape aka names from IMDB results. Is there any case where XBMC's re-order of names provides better results than IMDB's sort order? |
| find quote |
jmarshall
Team-XBMC Developer Posts: 24,523 Joined: Oct 2003 Reputation: 138 |
2010-01-11 06:03
Post: #10
Because IMDb may hit the "popular" result first, even though the actual result may be further down?
Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. ![]() |
| find quote |

![[Image: badge.gif]](http://www.ohloh.net/projects/9132/badge.gif)
Search
Help