TMDb to introduce movie hashing search
#1
This should speed up scraping a lot Wink

http://forums.themoviedb.org/topic/548/movie-hashes/

opensubtitles.org have been doing this for a while and its been very successful.
Reply
#2
If this gets added, I personally vote for "disabled by default but optional". I personally wouldn't want data like that to be sent around the net associated with an IP address, although I can see how it has the potential to reduce server load with TMDb.
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#3
While it might reduce the load on the tmdb server, it's got some things that might be of concern.

digital fingerprint of sorts, ip + hash of specific version of media file

dvd backups won't match unless they where done with the same app in the same manner

the many many different versions of movie files out there

i'd have to agree on the please off by default, option to remove / disable it.

the imdb id is a uid that already does a similar function without a specific link to the media file itself. And there's the tmdb id as well, which could be added to information about the media and parsed just like the imdb id could.
Reply
#4
Just to re-assure people that usage of the API is anonymous and no IP addresses are stored as far as I know. Only the application API key is transferred.

The hashing will only be stored after 4 submissions of the same key so it should get all the popular internet releases and 1 click ripping solutions.

The open subtitle directory has been using this method for a while and there have been no problems like you describe that I know of. TMDb is using the same algorithm for cross compatibility.
Reply
#5
I'm not doing this to help load, it's the holy grail of guaranteed title matching... by removing the fuzzy file name searches, we can greatly increase the chance of a the movies being picked up properly.

Opensubtitles has been doing this for a while now, and yes, we are using their technology to make sure we're all doing it the same way. thetvdb.com has also said they'll be doing it as well. It's got nothing but benefits (I can't think of a single negative really) so it's a no brainer for us.

I've already been chatting with the XBMC guys, and it's on the table to be discussed. They have not given me a firm yes or no either way (yet), but it is interesting to hear all of your thoughts.

I don't store any personal information, so you don't need to worry about anything like that. It's completely anonymous. The only things uploaded are the hash of the file, the imdb_id and that's it.

Imagine not caring what or how your files are named, and XBMC just like magic, knowing what titles they are. Doesn't that sound pretty awesome?
Reply
#6
I don't see the advantage here. You'll need a new hash for every version of every encoder and every container. Not to mention tags (which are a far better idea) completely ruin everything. Analyzing the raw encoded file data is useless. You need to devise a fingerprint based on the decoded content, similar to the musicbrainz ID. The musicbrainz ID may even be usable here w/o modification (assuming it's FLOSS), it will probably take ages to calculate though. You would still need a hash for every audio stream of every edition of the film, but surely that's far fewer permutations than with the current proposal.
Reply
#7
Yes, we already have that technology...

http://trac.opensubtitles.org/projects/o...ourceCodes

Opensubtitles has told me that aside from regular old user error, they have yet to have a duplicate hash. While the Moving Pictures guys (first big app we have on board) have told me the performance to generate the hash is more than reasonable.

We won't get every version, no, but we'll realistically be able to cover the vast majority of releases. I am not condoning or supporting anything here, but we all know most people get their content via the same places so having coverage for ~80%+ of the popular movies out there would be easy.
Reply
#8
I'd like to point out that the only people that really benifit from this are people who download the same movie as everyone else much like opensubtitles.org use scene release names for the subtitles. This has little to no benifit to people who actually make copies of their dvd's they own. Because like everyone else said depending on what program you use to rip the file or what codec you use and so on the hash will always be different
Reply
#9
The fingerprinting I was referencing was actually at the network level.

Any chance of a SSL API access? that would remove that concern.
(i.e. packet sent from IP XX.XX.XX.XX with request for movie by hash id YYYYY where the hash is generated based on local file information)

I do like the idea of it not mattering what the name or location of a file is Big Grin musicbrainz stuff is wicked cool tech.
Reply
#10
Jezz_X Wrote:I'd like to point out that the only people that really benifit from this are people who download the same movie as everyone else

So most people then Wink

Personally can't wait for this feature, it is indeed the holy grail to just let the scraper do its stuff without worrying about file structure, nfos or directory structure. Particularly looking forward to thetvdb implementing it as well.

It won't help everyone your right, but it will be a massive improvement.
Reply
#11
We're still discussing this internally, so note that this is my opinion only.

I can see why this idea may be useful for the subtitles people, as the subs people download are designed for a particular encode of the movie, and may not work with other encodes of that same movie (timing issues).

However, I see no real benefit in the case of looking up metadata in terms of improving efficacy:

1. It's fairly clear that the primary beneficiary are those who obtain "scene" releases, which brings implications of copyright infringement, and in turn possible privacy implications.
2. All such releases already have .nfo files with the download (else the "scene" rejects them), which I believe already have the imdb id in them, thus from XBMC's perspective, the gains appear minimal.
3. Clearly the 'hash' is computed based on the encoding, not based on content, so it's not particularly useful for identifying an item anyway (multiple 'hash's for each movie). The IMDb id (or tmdb id for that matter) is appears better for this purpose, as it is unique to the movie, not the encoding.

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#12
jmarshall Wrote:1. It's fairly clear that the primary beneficiary are those who obtain "scene" releases, which brings implications of copyright infringement, and in turn possible privacy implications.
Well the hash of a film is not copywrited so I don't see the problem there. Also as legitimate movie downloads take off, this feature could help with those as well. It already helps with full iso rips of dvds you own which should have the same hash. Privacy is a fair point which is why the transfer should always remain anonymous.

jmarshall Wrote:2. All such releases already have .nfo files with the download (else the "scene" rejects them), which I believe already have the imdb id in them, thus from XBMC's perspective, the gains appear minimal.
Very true but its another file that would not be needed if this hashing method does take off. Again, this method won't suit everyone so there will always need to be alternatives.

jmarshall Wrote:3. Clearly the 'hash' is computed based on the encoding, not based on content, so it's not particularly useful for identifying an item anyway (multiple 'hash's for each movie). The IMDb id (or tmdb id for that matter) is appears better for this purpose, as it is unique to the movie, not the encoding.
I think as the multiple hashes build up over time you will be surprised how many movies are recognized. Its the power of the community that will make or break this new feature I guess.
Reply
#13
These hashes are "bit-for-bit" which means that even the slightest difference between two files will result in a different hash. This would include "legit" downloads as each purchase would result in a different DRM wrapped file. In my opinion the hashing is absolutely useless in all cases except for torrent downloaders. I personally rip all my own dvds to ISOs stripping out all the extra stuff (menus, special features, etc.) leaving just the main title. Hashing those would be virtually useless, as others would have to do the exact same thing. And even then, I'm not so sure it generate the exact same file. I ripped a dvd a long time ago and then re-ripped it just the other day the exact same way using the same app and the hashes were different.

- Josh
Reply
#14
If someone seriously wants hashing, take a look at libofa and improve it to work well movies. It generates an audio fingerprint, but is intended for music and only accepts 135s of samples. This won't likely be enough to be accurate do to possibility of silent opening credits or the same intro song in a film.
Reply
#15
Rainbow 
hi

I think doing hash search will be a waste.. instead themoviedb could have a group of fields called ID-Cross-Reference and we could have field names like

imdb
ofdb
...
...

That way we can use only one scapper for TMDb and missing info can be fetched from other sources pointed to by TMDB

G
Reply

Logout Mark Read Team Forum Stats Members Help
TMDb to introduce movie hashing search0