TMDb to introduce movie hashing search

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
jmarshall Offline
Team-XBMC Developer
Posts: 24,564
Joined: Oct 2003
Reputation: 138
Post: #11
We're still discussing this internally, so note that this is my opinion only.

I can see why this idea may be useful for the subtitles people, as the subs people download are designed for a particular encode of the movie, and may not work with other encodes of that same movie (timing issues).

However, I see no real benefit in the case of looking up metadata in terms of improving efficacy:

1. It's fairly clear that the primary beneficiary are those who obtain "scene" releases, which brings implications of copyright infringement, and in turn possible privacy implications.
2. All such releases already have .nfo files with the download (else the "scene" rejects them), which I believe already have the imdb id in them, thus from XBMC's perspective, the gains appear minimal.
3. Clearly the 'hash' is computed based on the encoding, not based on content, so it's not particularly useful for identifying an item anyway (multiple 'hash's for each movie). The IMDb id (or tmdb id for that matter) is appears better for this purpose, as it is unique to the movie, not the encoding.

Cheers,
Jonathan

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


[Image: badge.gif]
find quote
zag Offline
Team-XBMC Member
Posts: 848
Joined: Oct 2007
Reputation: 9
Location: UK
Post: #12
jmarshall Wrote:1. It's fairly clear that the primary beneficiary are those who obtain "scene" releases, which brings implications of copyright infringement, and in turn possible privacy implications.
Well the hash of a film is not copywrited so I don't see the problem there. Also as legitimate movie downloads take off, this feature could help with those as well. It already helps with full iso rips of dvds you own which should have the same hash. Privacy is a fair point which is why the transfer should always remain anonymous.

jmarshall Wrote:2. All such releases already have .nfo files with the download (else the "scene" rejects them), which I believe already have the imdb id in them, thus from XBMC's perspective, the gains appear minimal.
Very true but its another file that would not be needed if this hashing method does take off. Again, this method won't suit everyone so there will always need to be alternatives.

jmarshall Wrote:3. Clearly the 'hash' is computed based on the encoding, not based on content, so it's not particularly useful for identifying an item anyway (multiple 'hash's for each movie). The IMDb id (or tmdb id for that matter) is appears better for this purpose, as it is unique to the movie, not the encoding.
I think as the multiple hashes build up over time you will be surprised how many movies are recognized. Its the power of the community that will make or break this new feature I guess.
(This post was last modified: 2009-09-18 14:30 by zag.)
find quote
walts81 Offline
Junior Member
Posts: 1
Joined: Sep 2009
Reputation: 0
Post: #13
These hashes are "bit-for-bit" which means that even the slightest difference between two files will result in a different hash. This would include "legit" downloads as each purchase would result in a different DRM wrapped file. In my opinion the hashing is absolutely useless in all cases except for torrent downloaders. I personally rip all my own dvds to ISOs stripping out all the extra stuff (menus, special features, etc.) leaving just the main title. Hashing those would be virtually useless, as others would have to do the exact same thing. And even then, I'm not so sure it generate the exact same file. I ripped a dvd a long time ago and then re-ripped it just the other day the exact same way using the same app and the hashes were different.

- Josh
find quote
althekiller Offline
Team-XBMC Developer
Posts: 4,710
Joined: May 2004
Reputation: 12
Post: #14
If someone seriously wants hashing, take a look at libofa and improve it to work well movies. It generates an audio fingerprint, but is intended for music and only accepts 135s of samples. This won't likely be enough to be accurate do to possibility of silent opening credits or the same intro song in a film.
find quote
ghatothkach Offline
Junior Member
Posts: 15
Joined: Nov 2008
Reputation: 0
Rainbow    Post: #15
hi

I think doing hash search will be a waste.. instead themoviedb could have a group of fields called ID-Cross-Reference and we could have field names like

imdb
ofdb
...
...

That way we can use only one scapper for TMDb and missing info can be fetched from other sources pointed to by TMDB

G
find quote
mikaelf Offline
Junior Member
Posts: 1
Joined: Oct 2009
Reputation: 0
Post: #16
phash.org looks like a more reasonable approach.
find quote
[2ge] Offline
Admin of OpenSubtitles.org
Posts: 25
Joined: Nov 2006
Reputation: 0
Post: #17
phash.org seems really nice, is there somebody who tried it in real world? I dont know about implementation, but it should work.

for "opensubtitles hash", I suggest you to store IP address in database as hash (MD5(MD5(IP)) - not possible to restore IP address), so you will get rid off of duplicate posts from same user.

[Image: 468x60_1.jpg]
find quote
szsori Offline
TheTVDB.com Admin
Posts: 663
Joined: Aug 2006
Reputation: 1
Location: Milwaukee, WI
Post: #18
As a note, TheTVDB is planning on doing this during our rewrite as well. A few points to make about it:

1. It's completely optional, meaning that projects or individuals that don't want to make use of it don't have to.
2. It's 100% accurate. The only other method that can claim this is the NFO one, which requires that users hang on to the NFO files for their downloaded files.
3. Not all downloaded files are going to be copyright infringements. Shows downloaded from iTunes and similar will have the same hash and are completely legitimate.
4. It's fast and automatic. One of our admins ran a test with his users (he has an MCE plugin) and found the algorithm was able to process over 10 files per second (IIRC). It requires no user intervention, which is important.
5. It can't lead to copyright lawsuits. Remember, in almost every country (including the US) it's not illegal to possess a "scene" version of a show or movie. It's why people downloading from Usenet feel safe... unlike torrents, they're not uploading part of the file during their download. So, the information that someone has 500 scene movies doesn't really matter unless they're sharing them with others.

We'll also be implementing better cross-reference lookups, but those still depend on an NFO file. Until one of the sites is up and running using hashes and at least one of the big projects implements it, there's no way to tell how much it's being used. However, I will say that it solves a number of sticky issues on our end internally and should result in better overall results.

Contribute to TheTVDB.com - The Online TV Database
find quote
[2ge] Offline
Admin of OpenSubtitles.org
Posts: 25
Joined: Nov 2006
Reputation: 0
Post: #19
szsori - my words. I started like this, when I asked Gabest - coder of MPC player, if he can give me his database (of subtitles, I am not sure, if he got there IMDB number, I must make a script for that...). He says yes, there was around 3000 hashes. I check his algo, it is far from the best, but it is pretty stable for now.

What I suggest is, to make another hashing method, much better, than this one, but using "old one" crc64 (which I use, and now you, and later somebody else...) will be default, and if some client implement "new" hashing, those hashes will be sent together. Later, can be switched only to new hashing method. This is for long discussion.

Anyway, important thing is to share these hashes between sites, so we can make much better services.

[Image: 468x60_1.jpg]
find quote
fekker Offline
Posting Freak
Posts: 1,545
Joined: Oct 2008
Reputation: 30
Post: #20
We've started adding this in to UMM as part of the searching options.

For subtitles it's the default method used (with opensubtitles.org) and seems to be very accurate

It'll be added into the tmdb options and later tvdb options. The tvdb options, when available, will be a great way to automate moving episodes into the correct folder for the show as the file names have way too many naming conventions (as well as lack of folks following them) .. one note on those, please do return the proper show name (and/or id) and season and episode numbers with those results).

- fekker
find quote