Clean scraping API

  Thread Rating:
  • 3 Votes - 3.67 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
garbear Offline
Team-Kodi Developer
Posts: 746
Joined: Dec 2010
Reputation: 37
Location: gangsta's paradise
Post: #46
(2013-04-02 14:25)malte Wrote:  1. artwork download: I have seen that you removed the artwork download in latest commits. Do you want to remove it completely from heimdall or do you just plan to move it to another module?
I took your artwork downloaders to the chop shop Smile (the rest of the tgdb scraper work was the perfect groundwork). They don't belong in a scraper module, because they aren't particular to a single scraper. Also, as they "supply" a file instead of data, heimdall sees an empty supply[], which is a conceptual problem. For example, topfs2 earlier mentioned sql-like queries, and if we want to optimize for small amounts of data extracted then backwards-chaining can be added to heimdall, which relies on conclusions being present.

Conclusion-less tasks aren't bad, per se, in fact it might they might be a useful tool in heimdall's data-driven programming environment. I feel that it should be discouraged, however, so that Heimdall can purely focus on enhancing its metadata processing algorithms.

malte Wrote:2. platform detection: It looks like you do platform detection via file extension...
Ideally, one of the principals behind heimdall is that it only runs tasks when necessary. So topfs2 brings up sticking in a demand = [demands.none(game.platform)] line, that way the client can specify the platform and cause that task to be skipped.

malte Wrote:3. title matching: In RCB I added some more logic to this part of the program to get better scrape results. E.g. trying to find 100% matches as automatic as possible (replacing metadata in [] and (), handling sequel numbers with digits or romes, ...) and also offer an interactive mode where the user gets a list of matches and can select the correct item manually. Is this a feature that you might consider as part of heimdall?
The best approach here is probably similar to my platform comparison algorithm, canonicalize the titles first (translate II to 2, brothers to bros, etc. like RCB does currently) and then run a fuzzy string comparison on the canonicalized titles.

EDIT: by best, I meant readily thought of at 2am Smile a bayesian filter and some basic discrimination training should make an interactive mode almost unnecessary.

(2013-04-02 15:03)topfs2 Wrote:  Indeed, this will always be a big problem. ATM The SearchTasks only returns a url (without any other data). This is due to https://github.com/topfs2/heimdall/issues/7 and if they could return more complex items they could return title and certainty percentage. That way RCB for example could only choose games with more than a hitrate of X %. (This is how current scraper engine in xbmc does IIRC).
Being able to retrieve a name, year, thumbnail and possibly other info alongside the URL would be helpful for the client so that it can solve possible disambiguation problems in the way that the client best sees fit.

Malte, have you started to work on other game scrapers yet? I'm planning one that parses ROMs directly for available embedded data.
(This post was last modified: 2013-04-03 06:38 by garbear.)
find quote
N3MIS15 Offline
Donor
Posts: 505
Joined: Jul 2010
Reputation: 13
Location: Melbourne, VIC
Post: #47
(2013-04-02 16:56)garbear Wrote:  The best approach here is probably similar to my platform comparison algorithm, canonicalize the titles first (translate II to 2, brothers to bros, etc. like RCB does currently) and then run a fuzzy string comparison on the canonicalized titles.

+1 on this, With thegamesdb module i was able to "trick" the title retrieving by using the name "Wonderboy III.sms" and it returned "Wonderboy" (the first in the series). Giving the full/correct name "Wonderboy III The Dragons Trap.sms" returned the correct result. Altenative titles/platforms may also need to be taken into account as, for example the platform "Playstation" is also commonly named "PSX" or "PS1".
On top of that regions may also need to be considered. Mario Bros. 2 (JP) is a totaly different game than Mario Bros. 2 (US)/(EU).

[Image: all-fanart.jpg]
find quote
garbear Offline
Team-Kodi Developer
Posts: 746
Joined: Dec 2010
Reputation: 37
Location: gangsta's paradise
Post: #48
(2013-04-02 14:25)malte Wrote:  2. platform detection:
It looks like you do platform detection via file extension. I am afraid this will always be error-prone at least when you have to deal with multi platform extensions like .img or .bin. Any plans how to solve this?
Just had some beers with a buddy and went into techno-babel mode about heimdall's AI aspects Smile He's not a programmer, so I was explaining to him that you started with a fact (such as "the movie's filename is /home/Avatar.mkv") and try to answer questions (what is the movie's format? what is the movie's genre?). You have tasks which operate on the subject, and their purpose is to generate new facts, assuming that their required premises are true.

I realized that the theory behind its operation is, word for word, that of an inference engine. Subject tasks are treated like inference rules (If P, then Q: "If subject has filename, then title is filename minus extension" or "If subject has title, then genre is <movie["genre"] from api.tmdb.org/title>"). Heimdall schedules the rule when its premises are fulfilled, and the outcome is the possible establishment of its conclusions. Starting with a fact or two, heimdall applies rules to infer new facts about the subject, adding each new fact to its knowledge base, and the cool thing is once a fact has been established it can be used to infer other facts as well (forward-chaining).

As far as future scraper devs go, I feel that re-enforcing the inference engine metaphor is a double edged sword. On the down side, it's a layer of complexity and frustration (AI isn't the most approachable subject). On the other hand, it gracefully handles a lot of other questions that are going to be asked. It helped me out when my buddy asked me a question about rule priorities: specifically, how does an inference engine choose which rule is "best", and like malte is wondering, which platform detection algorithm is best?

My buddy can't code but he likes to eat, so I gave an example of an inference engine that inferred the color of a fruit you were eating. If you started with the rule "If it's a lemon, then it's yellow" and you're eating a lemon, the engine will infer your fruit is yellow. How do you attach a priority to the statement "If it's a lemon, then it's yellow"? There's no concept of rule priority here - either the rule is true, or it isn't. If it's not true, it doesn't belong in the rule base, simple as that. This highlights two important points, the first is that Heimdall's rules must not introduce inconsistent information. (In a normal inference engine, you might have a consistency enforcer attempting to maintain a consistent representation of the emerging solution, usually using timestamps of derived facts or the occam razor principal.) For example, if you're eating a banana, you can't add a rule that says "If it's a banana, then it's yellow" because it might be green. Similarly, it's valid to have the rule "If it's a .nes file then it's a NES game" but not "If it's a .bin file then it's a PSX game".

The second important point i'll make is: don't ignore data. Let's say a scraper dev comes along and adds the following rules: "If it's unripe, it's green", "If it's ripe, it's yellow". Neither type or taste is sufficient, but with both you can unambiguously resolve color for lemons and bananas. As long as we don't violate the consistency requirement, rule bases are never-ending pots that we can keep teaching new ideas to.

Outside of a class from prof. M. Dyer, my AI experience is pretty limited. Topfs2, you're the one with the master's degree (though I wouldn't be surprised if prof Dyer co-authored a course book Smile). Task priorities and concerns over data consistency naturally arise from a task-based paradigm, so I'd like to hear your thoughts on re-enforcing the inference engine metaphor.

(2013-04-03 07:12)N3MIS15 Wrote:  +1 on this, With thegamesdb module i was able to "trick" the title retrieving by using the name "Wonderboy III.sms" and it returned "Wonderboy" (the first in the series). Giving the full/correct name "Wonderboy III The Dragons Trap.sms" returned the correct result. Altenative titles/platforms may also need to be taken into account as, for example the platform "Playstation" is also commonly named "PSX" or "PS1". On top of that regions may also need to be considered. Mario Bros. 2 (JP) is a totaly different game than Mario Bros. 2 (US)/(EU).
I'll give this more thought, but a bayesian filter should make an interactive mode virtually unnecessary.
find quote
malte Online
Skilled Python Coder
Posts: 1,351
Joined: Jan 2010
Reputation: 28
Location: Germany
Post: #49
garbear Wrote:I took your artwork downloaders to the chop shop. They don't belong in a scraper module, because they aren't particular to a single scraper.
I see Smile. I had in mind to move them to a separate module when I would start to implement other scrapers. But I can understand if you don't want them to be a part of heimdall at all. Maybe I will re-add them in a RCB specific version later (as you might have noticed I am not an elegant coder in the first place). From a pragmatic point of view downloading artwork is the most time consuming task in the whole scraping process and doing it multithreaded saves some valuable seconds per game. So before I start writing up some crappy threading thing in RCB I prefer to reuse what Tobias has already implemented.

garbear Wrote:EDIT: by best, I meant readily thought of at 2am a bayesian filter and some basic discrimination training should make an interactive mode almost unnecessary.
Not yet convinced that this may work but I am ready to be surprised (and to test). If it helps: I have created kind of a testset some years ago to test some of the more problematic games with RCB. Maybe I am able to dig it out or to recreate it.

garbear Wrote:
topfs2 Wrote:Indeed, this will always be a big problem. ATM The SearchTasks only returns a url (without any other data). This is due to https://github.com/topfs2/heimdall/issues/7 and if they could return more complex items they could return title and certainty percentage. That way RCB for example could only choose games with more than a hitrate of X %. (This is how current scraper engine in xbmc does IIRC).
Being able to retrieve a name, year, thumbnail and possibly other info alongside the URL would be helpful for the client so that it can solve possible disambiguation problems in the way that the client best sees fit.
Not sure if I understand that right. Will the client get one result and decides if it is the correct one or will it be a two step approach where the client gets a list of results and starts a second run with the correct item?

garbear Wrote:Malte, have you started to work on other game scrapers yet? I'm planning one that parses ROMs directly for available embedded data.
Not yet. But I wanted to reimplement all scrapers that RCB currently supports and maybe add one or two that are currently not supported. This is the list I have in mind:

Online scrapers
- thegamesdb
- archive.vg (xml API)
- giantbomb (xml API)
- mobygames (html scraping)
- some alternative to MAWS to scrape Arcade games
- maybe GameFAQ (see question below)

Offline scrapers:
- nfo files
- emuxtras desc files
- maybe other desc files that are provided for MAME or other rom sets

But I guess some of these sources may just remain RCB specific and are not relevant to heimdall.

As mobygames is one of the most complete available sources atm I think I would continue with this one. If you want to start this yourself just let me know and I will sit still. Otherwise I could do the groundwork again and you could chime in where I sucked. If I would give it a try I would start with BeautifulSoup. Any better ideas?

Another question about implementing scrapers: As heimdall might be more official part of XBMC, how do you handle scraping permissions for the scraped sites? For example, I got official permission to scrape data from thegamesdb, archive.vg and giantbomb. Mobygames did not respond to my request but they do not explicitly forbid scraping in their terms of use, so I decided it will be ok to implement it. No idea if XBMC can go the same route or if we have to ask again. GameFAQ did also not respond but they explicitly forbid automated scraping so I decided not to implement it. Maybe they will respond if XBMC asks officially for scraping permission.

I also want to get as much information as possible from the scrapers. E.g. thegamesdb also provides information and artwork for consoles and publishers/developers. Do you plan to use this in heimdall/RetroPlayer too?
find quote
garbear Offline
Team-Kodi Developer
Posts: 746
Joined: Dec 2010
Reputation: 37
Location: gangsta's paradise
Post: #50
(2013-04-03 12:56)malte Wrote:  I see Smile. I had in mind to move them to a separate module when I would start to implement other scrapers. But I can understand if you don't want them to be a part of heimdall at all. Maybe I will re-add them in a RCB specific version later (as you might have noticed I am not an elegant coder in the first place). From a pragmatic point of view downloading artwork is the most time consuming task in the whole scraping process and doing it multithreaded saves some valuable seconds per game. So before I start writing up some crappy threading thing in RCB I prefer to reuse what Tobias has already implemented.
In my last post I mentioned one of the theories that a pragmatic point of view might overlook. And your pragmatism didn't go unnoticed Smile Artwork downloaders might not belong in a metadata extractor, but the dataflow paradigm in heimdall is perfect for executing a large number of heavy tasks in parallel as their data becomes available. I think that when tobias makes heimdall an xbmc library, you can import the package and supply your own artwork downloader module. That way the download locations and configuration, etc. can be hidden from heimdall.

malte Wrote:Not yet convinced that this may work but I am ready to be surprised (and to test). If it helps: I have created kind of a testset some years ago to test some of the more problematic games with RCB. Maybe I am able to dig it out or to recreate it.
Cool, I'd like to see the test. As with any sort of natural-language search, domain knowledge and machine learning will probably be the key to optimality. The last filter I wrote somehow got dumber as it went on... not a good sign Smile

malte Wrote:Not sure if I understand that right. Will the client get one result and decides if it is the correct one or will it be a two step approach where the client gets a list of results and starts a second run with the correct item?
Probably a callback function that takes a list of results and returns the correct/chosen one, or an error code to abort the current task.

malte Wrote:As mobygames is one of the most complete available sources atm I think I would continue with this one. If you want to start this yourself just let me know and I will sit still. Otherwise I could do the groundwork again and you could chime in where I sucked. If I would give it a try I would start with BeautifulSoup. Any better ideas?
I'm of the opinion that the dude with the 10,000 line python program knows what he's doing Smile And I'm pretty wrapped up in my filter right now. If you push to a development branch, I'll probably check in after a few days to make sure Heimdall is being worshiped properly Wink

malte Wrote:Another question about implementing scrapers: As heimdall might be more official part of XBMC, how do you handle scraping permissions for the scraped sites? For example, I got official permission to scrape data from thegamesdb, archive.vg and giantbomb. Mobygames did not respond to my request but they do not explicitly forbid scraping in their terms of use, so I decided it will be ok to implement it. No idea if XBMC can go the same route or if we have to ask again. GameFAQ did also not respond but they explicitly forbid automated scraping so I decided not to implement it. Maybe they will respond if XBMC asks officially for scraping permission.
For now, Heimdall and XBMC are separate, so TOS negotiations are handled separately.

malte Wrote:I also want to get as much information as possible from the scrapers. E.g. thegamesdb also provides information and artwork for consoles and publishers/developers. Do you plan to use this in heimdall/RetroPlayer too?
The more the merrier Smile I think you would target consoles or publishers as subjects in Heimdall, like container.audio.Artist and container.audio.Album classes.
find quote
topfs2 Offline
Team-Kodi Developer
Posts: 4,063
Joined: Dec 2007
Reputation: 10
Post: #51
(2013-04-03 11:27)garbear Wrote:  Just had some beers with a buddy and went into techno-babel mode about heimdall's AI aspects Smile He's not a programmer, so I was explaining to him that you started with a fact (such as "the movie's filename is /home/Avatar.mkv") and try to answer questions (what is the movie's format? what is the movie's genre?). You have tasks which operate on the subject, and their purpose is to generate new facts, assuming that their required premises are true.

I realized that the theory behind its operation is, word for word, that of an inference engine. Subject tasks are treated like inference rules (If P, then Q: "If subject has filename, then title is filename minus extension" or "If subject has title, then genre is <movie["genre"] from api.tmdb.org/title>"). Heimdall schedules the rule when its premises are fulfilled, and the outcome is the possible establishment of its conclusions. Starting with a fact or two, heimdall applies rules to infer new facts about the subject, adding each new fact to its knowledge base, and the cool thing is once a fact has been established it can be used to infer other facts as well (forward-chaining).

As far as future scraper devs go, I feel that re-enforcing the inference engine metaphor is a double edged sword. On the down side, it's a layer of complexity and frustration (AI isn't the most approachable subject). On the other hand, it gracefully handles a lot of other questions that are going to be asked. It helped me out when my buddy asked me a question about rule priorities: specifically, how does an inference engine choose which rule is "best", and like malte is wondering, which platform detection algorithm is best?

My buddy can't code but he likes to eat, so I gave an example of an inference engine that inferred the color of a fruit you were eating. If you started with the rule "If it's a lemon, then it's yellow" and you're eating a lemon, the engine will infer your fruit is yellow. How do you attach a priority to the statement "If it's a lemon, then it's yellow"? There's no concept of rule priority here - either the rule is true, or it isn't. If it's not true, it doesn't belong in the rule base, simple as that. This highlights two important points, the first is that Heimdall's rules must not introduce inconsistent information. (In a normal inference engine, you might have a consistency enforcer attempting to maintain a consistent representation of the emerging solution, usually using timestamps of derived facts or the occam razor principal.) For example, if you're eating a banana, you can't add a rule that says "If it's a banana, then it's yellow" because it might be green. Similarly, it's valid to have the rule "If it's a .nes file then it's a NES game" but not "If it's a .bin file then it's a PSX game".

The second important point i'll make is: don't ignore data. Let's say a scraper dev comes along and adds the following rules: "If it's unripe, it's green", "If it's ripe, it's yellow". Neither type or taste is sufficient, but with both you can unambiguously resolve color for lemons and bananas. As long as we don't violate the consistency requirement, rule bases are never-ending pots that we can keep teaching new ideas to.

Outside of a class from prof. M. Dyer, my AI experience is pretty limited. Topfs2, you're the one with the master's degree (though I wouldn't be surprised if prof Dyer co-authored a course book Smile). Task priorities and concerns over data consistency naturally arise from a task-based paradigm, so I'd like to hear your thoughts on re-enforcing the inference engine metaphor.

This might be the best written explanation of what I tried to achieve with heimdall Smile This is exactly what I had in mind how heimdall will find the scraping pipeline, essentially by stumbling across the end result by interference.

The thing which breaks this is when the rules polishes a inference, for example title. There will be several tasks which will alter that title (polishing it and making it nicer). So there is a small sense of priority in heimdall to accomondate for that, i.e. if a rule depends on title (like tmdb search) don't infer until all rules which infer it has done so.

e.g.
File URL -> title.
File URL -> MediaInfo
MediaInfo -> Audio/Video
Audio/Video -> Duration
Duration -> Movie/TV Show
Movie/TV Show -> (polished) title

title -> tmdb.

so in this case we can't infer title -> tmdb until Movie/TV Show -> Polished title

So the pipeline becomes:

File URL -> title, Media Info -> Audio/Video -> Duration -> Movie/TV Show -> (polished) title -> tmdb

This is mostly to make it more approachable. It would be very possible to do this by pure interference by actually naming the property polished title. Then tmdb couldnt be run until that is inferred

If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

[Image: badge.gif]

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
find quote
natethomas Offline
Team Kodi Community Manager
Posts: 3,716
Joined: Apr 2008
Reputation: 63
Location: Kansas
Post: #52
This all makes me think Heimdall might be a major solution to the age old "separate scrapers for movies and tv" problem. It seems like it shouldn't be all that impossible to put together a series of inference rules that would quickly and easily solve for the problem, so long as the person developing the rules was aware of the major outliers. For example, "If the video is shorter than 60 minutes, it is either a tv show or a short film." "If the video uses a TV show numbering scheme, it is a TV show." "If the video is longer than 60 minutes, it is either a tv show or a regular film." "If the video uses a TV show numbering scheme (e.g. Sherlock), it is a TV show."

Man, it'd be pretty sweet to make Heimdall the default scraper engine across all of XBMC, and then just tell people to point XBMC at ALL of their media (movies, tv, music videos, music, games, etc.) and let XBMC/Heimdall do the rest without any input from the user.

I may not have been very enthusiastic about Heimdall before, topfs2, but if I'm understanding this right, that is incredibly, incredibly awesome.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

[Image: badge.gif]
find quote
topfs2 Offline
Team-Kodi Developer
Posts: 4,063
Joined: Dec 2007
Reputation: 10
Post: #53
(2013-04-04 10:23)natethomas Wrote:  This all makes me think Heimdall might be a major solution to the age old "separate scrapers for movies and tv" problem. It seems like it shouldn't be all that impossible to put together a series of inference rules that would quickly and easily solve for the problem, so long as the person developing the rules was aware of the major outliers. For example, "If the video is shorter than 60 minutes, it is either a tv show or a short film." "If the video uses a TV show numbering scheme, it is a TV show." "If the video is longer than 60 minutes, it is either a tv show or a regular film." "If the video uses a TV show numbering scheme (e.g. Sherlock), it is a TV show."

Indeed, and part of what you suggest is actually how its guessing if its a movie or tv show right now! "If its a video and is longer than 60 minutes its a movie". https://github.com/topfs2/heimdall/blob/...tem.py#L21

(2013-04-04 10:23)natethomas Wrote:  Man, it'd be pretty sweet to make Heimdall the default scraper engine across all of XBMC, and then just tell people to point XBMC at ALL of their media (movies, tv, music videos, music, games, etc.) and let XBMC/Heimdall do the rest without any input from the user.

I may not have been very enthusiastic about Heimdall before, topfs2, but if I'm understanding this right, that is incredibly, incredibly awesome.

Haha, that might have been due to me doing a very very bad presentation at devcon Smile Me almost fainting due to being hungry and not knowing how to explain well to begin with isn't exactly a great recipe to get people excited by it Tongue

If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

[Image: badge.gif]

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
find quote
natethomas Offline
Team Kodi Community Manager
Posts: 3,716
Joined: Apr 2008
Reputation: 63
Location: Kansas
Post: #54
(2013-04-04 10:56)topfs2 Wrote:  Indeed, and part of what you suggest is actually how its guessing if its a movie or tv show right now! "If its a video and is longer than 60 minutes its a movie". https://github.com/topfs2/heimdall/blob/...tem.py#L21

That part could be tricky for very specific British dramas. I think there are a few shows that run a good hour and half that would technically be considered TV shows, like Sherlock and probably various Dr. Who Christmas Specials (Never watch Dr. Who, so I can't say for certain on that one).

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

[Image: badge.gif]
find quote
da-anda Offline
Team-Kodi Member
Posts: 3,118
Joined: Jun 2009
Reputation: 39
Location: germany
Post: #55
yes, it's not good to consider anything > 60 min as movie. The first episode of a TV show is often longer than 60 min.
find quote
topfs2 Offline
Team-Kodi Developer
Posts: 4,063
Joined: Dec 2007
Reputation: 10
Post: #56
Nah it won't be enough. It was mostly a trial. And as garbear stated. You can just as easy propose a file that you state is a movie to start (which is what we do currently in xbmc), and the infering of that guessing rule is void.

The first part of GSoC I spent gathering data to do the guessing in a nice way, and I have a few GB of documents that I could use as a basis for a real AI to guess well.

If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

[Image: badge.gif]

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
find quote
Bstrdsmkr Offline
Posting Freak
Posts: 803
Joined: Oct 2010
Reputation: 17
Post: #57
(2013-04-05 11:25)da-anda Wrote:  yes, it's not good to consider anything > 60 min as movie. The first episode of a TV show is often longer than 60 min.

Getting back to what garbear said though, never throw any data away =)
Your statement assumes that this is the only criteria for determining whether it's a movie or a TV show. This may not be enough by itself, but it gives you a starting point.

Put another way, there are two basic ways of inferring something:
1. Hard facts - These are indisputable/trusted pieces of information such as "This file is 151mb" and are pretty straight forward
2. Process of elimination - These are based on indicators but are not in and of themselves desired information. An example of a collection of criteria:
1. This file is in the movies folder, so it's more likely that it's a movie than a TV show
2. This file's size is greater than 1gb, so it'ls more likely that it's a movie than a TV show
3. We parsed the file name and found a match for the regex [sS]\d{1,3}, so it has a season marker and is more likely to be a TV show than a movie

Now take each of those criteria and add 1 point for every one that indicates a movie. Now subtract a point for everything that indicates a TV show. If the final score is positive, we're looking at a movie. If it's negative, we're looking at a TV show. In this case, we end with a score of +2 out of a possible +/-3 or 66% confident. From there, we could decide that we trust the user's placement in the file system more than the other factors and assign it a value of 2 instead of one. Now our final score would be +3 out of a possible +/-4, or 75% confident.

At that point, it's worth risking wasting a call to imdb.com to say "do you have any movies with this title?" but probably not worth risking wasting a call to tvdb.com to say "Do you have a TV show with this title and at least this many seasons?"
find quote
garbear Offline
Team-Kodi Developer
Posts: 746
Joined: Dec 2010
Reputation: 37
Location: gangsta's paradise
Post: #58
(2013-04-05 09:58)natethomas Wrote:  That part could be tricky for very specific British dramas. I think there are a few shows that run a good hour and half that would technically be considered TV shows, like Sherlock and probably various Dr. Who Christmas Specials (Never watch Dr. Who, so I can't say for certain on that one).

I think the solution's quite clear, we're looking for a yes-or-no oracle function simply that doesn't exist. It's a good example of what I said above, "you can't add a rule that says "If it's a banana, then it's yellow" because it might be green". I'll be surprised if we ever see a rule where time is a major deciding factor in the movie vs. tv show decision.

HOWEVER, if we're considering this, we would need to be relatively certain of our assertion. The way to do this requires knowing P(move|time), the probability that our file is a movie given the runtime. That's hard to calculate, but with bayes' theorem, this isn't:
Code:
.          P(time|movie) * P(movie)
---------------------------------------------
[ P(time|movie)*P(movie) + P(time|tv)*P(tv) ]
where P(movie) and P(tv) are the percentage of movies and TV shows in XBMC libraries (I'd use 50/50 to be safe), and P(time|movie) and P(time|tv) are the file runtime fit to a distribution of average movie/tv show times.

For example, assuming 1000 movies from tmdb and tv shows from tvdb were sampled, and we decided a normal distribution was appropriate with movies averaging 120 minutes with σ = 40 mins, and tv shows averaging 30 minutes with σ = 10 mins. If the runtime was 30 minutes, then P(time|tv) = 1.0 and P(time|movie) = 1 - erf(|t - µ| / (σ * sqrt(2))) = 1 - erf(|30m - 120m| / (40m * sqrt(2))) ~= 2.44%. P(move) and P(tv) are both 0.5, so the probability that it's a movie given a 30 minute runtime is 0.0244 * 0.5 / (0.0244 * 0.5 + 1.0 * 0.5) = 2.38%.

So, runtime-based movie/tv show detection is still on the table, we just have to agree on a safe degree of belief. And like I said earlier, "dont throw away data" - like inference rules, bayes' theorem lets us chain together rules that build a degree of belief on a base of evidence. Being able to propagate these calculations forward alongside inference rules would allow for some powerful combinations of conditional probability.
(This post was last modified: 2013-04-06 03:13 by garbear.)
find quote
jmarshall Offline
Team-XBMC Developer
Posts: 26,230
Joined: Oct 2003
Reputation: 177
Post: #59
Give it up for the Reverend Bayes...

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


[Image: badge.gif]
find quote
topfs2 Offline
Team-Kodi Developer
Posts: 4,063
Joined: Dec 2007
Reputation: 10
Post: #60
I've done an initial analysis of the data gathered during GSoC and I just wanted to add some calculated values to the example above.

Code:
P(Episode    | Video) = 0.843797662418
P(Movie      | Video) = 0.154526991252
P(MusicVideo | Video) = 0.00167534633044

These numbers I got simply from looking at the posted length of the data, e.g
Code:
len(Episode)     = 640650
len(Movie)       = 117324
len(MusicVideo)  = 1272
len(Video)       = 233118
len(TotalVideos) = len(Episode) + len(Movie) + len(MusicVideo)

Where len(Video) are videos not in the database (unscraped for a reason or simply missed)
I skipped len(Video) since I want the probability to add up to 1.

Code:
P(Episode|Video) = len(Episode) / len(TotalVideos)

I also did a check at the runtimes of movies (with 10k movies), where I think it could be valid to assume normal distribution with µ=105.624426079 σ=22.9860217337
[Image: 6vcc2AJ.png?1]

I have noticed that the movie database seems to contain some tv shows aswell, which might be what is causing it to shift slightly to the left.

I want to add a disclaimer on this since its just a very quick, initial analysis. And it was a long time since I did statistics math Smile
And that randomly selected movies/episodes from tmdb and tvdb might be a better indicator on the runtime distribution.

Cheers,
Tobias

If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

[Image: badge.gif]

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
(This post was last modified: 2013-04-10 08:42 by topfs2.)
find quote
Post Reply