Using Bing for better scraper searching

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
rufus210 Offline
Junior Member
Posts: 13
Joined: Jan 2010
Reputation: 0
Post: #11
That's always possible, but I've yet to find a case where XBMC's resorting is (or with aka names could be) better than using IMDB's sorting order.

In the case of Infernal Affairs (which was mentioned earlier) both The Departed and Mou gaan dou list it as an AKA name, so re-sorting with AKA names could give you either one of the two. Going with the default IMDB sort gives you The Departed, which is still wrong. My scraper using Bing gives the correct result (Mou gaan dou).

Another issue I just discovered: IMDB's search ignores years. Searching for Oceans Eleven (1960) and Oceans Eleven (2001) give identical results. My Bing-based search gives you the correct one.

The summary from all of my testing so far:
Existing IMDB scraper: lots of issues, never gets the best results
IMDB scraper with sorted="yes": fixes most of the issues, but not all of them
Bing+IMDB scraper: I have yet to find a case where it was not correct
find quote
jmarshall Offline
Team-XBMC Developer
Posts: 24,570
Joined: Oct 2003
Reputation: 138
Post: #12
Nope, the correct result for this movie was infact The Departed Tongue

XBMC's sorting checks the year in addition to the title, so +1 from there as well.

I doubt that improving the IMDb scraper via my suggested route would work better than <arbitrary search engine>, though I'm just guessing there. The main advantage is not relying on said search engine.

Cheers,
Jonathan

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


[Image: badge.gif]
find quote
snoop2048 Offline
Junior Member
Posts: 22
Joined: Dec 2009
Reputation: 0
Post: #13
rufus210,

Cheers mate - the scraper you have provided has vastly improved my matches!

100% thus far. Good work.
find quote
takoi Offline
Fan
Posts: 511
Joined: Oct 2009
Reputation: 6
Location: Norway
Post: #14
i dont know how you think this is supposed to be done. only way i can see is to list movies for each aka title. and honestly, i don't see how you can justify that.

btw; first one i tried, "Dream 2008", failed. unsorted imdb succeeds. you need to test more than 17 movies rufus. (will test the rest of mine later)
(This post was last modified: 2010-01-11 17:59 by takoi.)
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,234
Joined: Nov 2003
Reputation: 82
Post: #15
you search on imdb.

it tosses up a list of movies, including aka titles. we grab all the titles (something we don't do now). the sorting does its magic. voilĂ .

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
takoi Offline
Fan
Posts: 511
Joined: Oct 2009
Reputation: 6
Location: Norway
Post: #16
ok then.
gave it a shot but i cant figure it out. im all out of ideas now and just dont see how this is possible. since none of you have written it already im assuming you dont know either??
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,234
Joined: Nov 2003
Reputation: 82
Post: #17
rather i haven't had time to look into it - it should very much doable.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
takoi Offline
Fan
Posts: 511
Joined: Oct 2009
Reputation: 6
Location: Norway
Post: #18
Parsing id, year and title/aka-title in one expression; not doable.
Storing id and year, then parse titles; doable but not repeatable.
Using custom functions; not possible

now whats left?
find quote
jmarshall Offline
Team-XBMC Developer
Posts: 24,570
Joined: Oct 2003
Reputation: 138
Post: #19
First off, post the code you've actually done so that others don't have to do anything from scratch.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


[Image: badge.gif]
find quote
takoi Offline
Fan
Posts: 511
Joined: Oct 2009
Reputation: 6
Location: Norway
Post: #20
Code:
<RegExp input="$$8" output="\1" dest="7">
    <RegExp input="$$7" output="&lt;year&gt;\3&lt;/year&gt;&lt;url&gt;http://akas.imdb.com/title/\1/&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;" dest="9"> url
        <expression noclean="1">&gt;&lt;a href=&quot;/title/([t0-9]*)/[^&gt;]*&gt;([^&lt;]*)&lt;/a&gt; *\(([0-9]*)(.*?)&gt;[0-9]?\.&lt;</expression>
    </RegExp>
    <RegExp input="$$7" output="&lt;entity&gt;&lt;title&gt;\4&lt;/title&gt;$$9&lt;/entity&gt;" dest="8">
        <expression repeat="yes" noclean="1">(&gt;&lt;a href=&quot;/title/[t0-9]*/[^&gt;]*&gt;([^&lt;]*)&lt;/a&gt;)|(aka\s+&lt;em&gt;([^&lt;&gt;]+)&lt;/em&gt;)</expression>
    </RegExp>
    <expression clear="yes" noclean="1"/>
</RegExp>
fine but i dont see what good it does. the algorithm is the problem here not the code.
find quote