script.module.metautils dev

  Thread Rating:
  • 2 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Eldorado Offline
Fan
Posts: 520
Joined: May 2009
Reputation: 14
Post: #16
k_zeon Wrote:ahh thanks.

Does this mean that i would need to scrape an IMDB number from each movie the first time round.
The new TVShack has an IMDB number once you click into the page, so what i would need to do is
1.get the webpage and scrape the IMBD number
2.use metautils to get the movie information
3. Then call the add_directory and place all info as you mentioned for each movie.

If movie found then would scrape info . next time if data already there would it not scrape the info.

Pretty much, I am in the same boat as you and really it's not ideal to have to scrape so many pages every time you load a video list.. I would suggest not going that route as I imagine the site owners not being too happy

One of the things I would like to figure out is scraping based on movie name.. haven't looked at doing any enhancements like that yet..
find quote
Eldorado Offline
Fan
Posts: 520
Joined: May 2009
Reputation: 14
Post: #17
Small update

I've added the ability to search by movie name + year to TMDB, the year is optional

If not found on TMDB then it tries IMDB with the same query, populates as much info as it can

Key is to pass in clean movie names, strip everything that does not belong to the name itself, the year will help in returning correct matches

The call is now - meta = metaget.get_meta(imdb,type,name, year)

I will write some docs up as this gets closer to release


Side q, I'm having some issues retrieving a specific URL:
Code:
http://www.imdbapi.com/?t=Bridget Joness Diary&y=2001

I get a 400 Bad Request error with python, opens fine in browsers and I can't see anything odd in the headers, even with urlencoding to handle the spaces
find quote
k_zeon Offline
Senior Member
Posts: 217
Joined: Aug 2011
Reputation: 0
Post: #18
Eldorado Wrote:Small update

I've added the ability to search by movie name + year to TMDB, the year is optional

If not found on TMDB then it tries IMDB with the same query, populates as much info as it can

Key is to pass in clean movie names, strip everything that does not belong to the name itself, the year will help in returning correct matches

The call is now - meta = metaget.get_meta(imdb,type,name, year)

I will write some docs up as this gets closer to release


Side q, I'm having some issues retrieving a specific URL:
Code:
http://www.imdbapi.com/?t=Bridget Joness Diary&y=2001

I get a 400 Bad Request error with python, opens fine in browsers and I can't see anything odd in the headers, even with urlencoding to handle the spaces

Just tried following in Idle and get info back

import urllib2,urllib,re


url='http://www.imdbapi.com/?t=Bridget%20Joness%20Diary&y=2001'

req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')

response = urllib2.urlopen(req)
link=response.read()
response.close()
match=re.compile('{(.+?)}').findall(link)


print match
find quote
Eldorado Offline
Fan
Posts: 520
Joined: May 2009
Reputation: 14
Post: #19
Very odd.. now it works fine for me today!

k_zeon, have you tested metautils with your addon yet?

I've been testing mainly with my Project Free TV addon with pretty good results, I don't have IMDB id's so all testing I'm doing is by movie name + year

I need to start looking into the tv show scraping next and ensure it's working correctly, then a LOT more code clean up!

Also, I've noticed that for some reason they lumped the cast into the same data with the plot/overview.. not sure the reasoning for this, but I would like to split this back out
find quote
k_zeon Offline
Senior Member
Posts: 217
Joined: Aug 2011
Reputation: 0
Post: #20
Eldorado Wrote:Very odd.. now it works fine for me today!

k_zeon, have you tested metautils with your addon yet?

I've been testing mainly with my Project Free TV addon with pretty good results, I don't have IMDB id's so all testing I'm doing is by movie name + year

I need to start looking into the tv show scraping next and ensure it's working correctly, then a LOT more code clean up!

Also, I've noticed that for some reason they lumped the cast into the same data with the plot/overview.. not sure the reasoning for this, but I would like to split this back out

Not had a chance yet, family's had a cold all week.
Would like to have a go sometime this week

so basically i would use:

meta = metaget.get_meta(imdb,type,name, year)

add_video_item({'url': url},{meta},img=thumb)

and would i have to get the Pic url as well then pass to img
find quote
Eldorado Offline
Fan
Posts: 520
Joined: May 2009
Reputation: 14
Post: #21
k_zeon Wrote:Not had a chance yet, family's had a cold all week.
Would like to have a go sometime this week

so basically i would use:

meta = metaget.get_meta(imdb,type,name, year)

add_video_item({'url': url},{meta},img=thumb)

and would i have to get the Pic url as well then pass to img

Currently you can't simply pass 'meta' returned into your add_video_item, for some reason the original programmer didn't give them the proper names that xbmc expects, something I should fix.. also I'm not sure how xbmc will react to all the extra values being returned which are not valid infolabels

For now you need to parse out what is returned into your own dict

eg.

infoLabels = {}
infoLabels['genre'] = meta['genres']
etc..

add_video_item({'url': url},infoLabels,img=meta['cover'], fanart=meta['fanart'])
find quote
Eldorado Offline
Fan
Posts: 520
Joined: May 2009
Reputation: 14
Post: #22
Many more updates:

- Renamed project to metahandler as metautils was interferring with Icefilms Smile

- Scraping of movies by Name and Year (optional) is fully working and so far seems quite accurate, scraping of movies by IMDB is always more ideal and faster

- TV Show Season scraping is working by either IMDB or TV Show name

eg.
Code:
metapath = xbmc.translatePath('special://profile/addon_data/script.module.metahandler/meta_cache')
    metaget=metahandlers.MetaData(metapath,preparezip = True)
    meta = metaget.get_meta('','tvshow','The Simpsons')
    print meta

Seaons and Episodes I believe will work fine as well, but haven't dug into that portion of the code yet to see what needs to be cleaned up

Once you have the TV Show scraped you *should* have the IMDB id

Season:
Code:
season_list = ['Season 1','Season 2']
    seasons = metaget.getSeasonCover(imdb_id, season_list, refresh=False)
    print seasons

Episode:
Code:
episode=metaget.get_episode_meta(imdb_id, season_num, episode_num)
    print episode

I'm hoping others can test this and mainly spot it's accuracy and areas where it can be improved, this is mainly dealing with scraping by only Movie/TV Show name - no IMDB

Question:
Currently the cache DB stores a number of items including the movie/tv show name.. as sites could potentially leave out special characters (or even spaces), and capitalization when spelling out the name it causes the search by name difficult

So for my string compares I have resorted to putting all strings to lower case and stripping all non alphanumerics using isalnum()

Looking for comments on whether I should also store the movie/tv show names with everything stripped and lower cased? The name is used to initially check if it exists in cache before it tries to scrape, and checks again before it tries to add a new entry to the DB
(This post was last modified: 2011-09-30 01:40 by Eldorado.)
find quote
t0mm0 Offline
Fan
Posts: 486
Joined: Mar 2011
Reputation: 8
Location: UK
Post: #23
Eldorado Wrote:Many more updates:
looks like this is going really well eldorado, sorry i have not helped out much but i am really busy at the momentSad
Eldorado Wrote:I'm hoping others can test this and mainly spot it's accuracy and areas where it can be improved, this is mainly dealing with scraping by only Movie/TV Show name - no IMDB
i'll try adding support in letmewatchthis when i get a chance and see how it goes!
Eldorado Wrote:Question:
Currently the cache DB stores a number of items including the movie/tv show name.. as sites could potentially leave out special characters (or even spaces), and capitalization when spelling out the name it causes the search by name difficult

So for my string compares I have resorted to putting all strings to lower case and stripping all non alphanumerics using isalnum()

Looking for comments on whether I should also store the movie/tv show names with everything stripped and lower cased? The name is used to initially check if it exists in cache before it tries to scrape, and checks again before it tries to add a new entry to the DB

if you need to query the db using the stripped form then i would say it is probably worth putting it in the database.

i guess you need to think carefully about whether you end up with duplicate names once you do the stripping and how that is dealt with? i presume there is just no way of telling them apart anyway. maybe try exact matches first?

maybe normalizing unicode to ascii would be better (like i do for logging) than just stripping stuff (guess that wouldn't help for spaces in different places though)

t0mm0.
find quote
Eldorado Offline
Fan
Posts: 520
Joined: May 2009
Reputation: 14
Post: #24
t0mm0 Wrote:looks like this is going really well eldorado, sorry i have not helped out much but i am really busy at the momentSadi'll try adding support in letmewatchthis when i get a chance and see how it goes!

if you need to query the db using the stripped form then i would say it is probably worth putting it in the database.

i guess you need to think carefully about whether you end up with duplicate names once you do the stripping and how that is dealt with? i presume there is just no way of telling them apart anyway. maybe try exact matches first?

maybe normalizing unicode to ascii would be better (like i do for logging) than just stripping stuff (guess that wouldn't help for spaces in different places though)

t0mm0.

Sweet, I'm assuming you don't have imdb id's with letmewatchthis? The scraping of movies/tv shows by just the name is what I'm interested in, by IMDB is pretty straight forward

When searching by name:

- a movie it will take the first match
- tvshow will scan the results and look for an exact title match

I'm thinking maybe I should save the movie/tv show name all stripped down

Using .isalnum() is kinda handy as it removes everything but alphanumerics

I do have some issues with unicode stuff right now too, some plot's being returned having characters in with accents etc. that it just doesn't like.. seems to be some hacky type code in right now to try and correct it, so just a heads up in case you get errors of the sort Smile
find quote
t0mm0 Offline
Fan
Posts: 486
Joined: Mar 2011
Reputation: 8
Location: UK
Post: #25
Eldorado Wrote:Sweet, I'm assuming you don't have imdb id's with letmewatchthis? The scraping of movies/tv shows by just the name is what I'm interested in, by IMDB is pretty straight forward
yeah it does show imdb numbers once you get to the sources page, but that doesn't get fetched until you press play so is useless unless you fetch the sourcs page for every movie in the list which seems like a bad idea! i think that is the case on a lot of sites - we need a campaign to get sites to include the imdb code on their search results/browse pages Wink.
Eldorado Wrote:When searching by name:

- a movie it will take the first match
- tvshow will scan the results and look for an exact title match

I'm thinking maybe I should save the movie/tv show name all stripped down

Using .isalnum() is kinda handy as it removes everything but alphanumerics
makes sense, i'll see how it works for me and let you know
Eldorado Wrote:I do have some issues with unicode stuff right now too, some plot's being returned having characters in with accents etc. that it just doesn't like.. seems to be some hacky type code in right now to try and correct it, so just a heads up in case you get errors of the sort Smile

are you using t0mm0.common.net? that should return properly encoded unicode so you don't have to worry (unless the webpage lies about the character set in use as we discovered before, i still need to commit the fix/workaround for that). if it doesn't then let me know and it'll be a useful test case...

t0mm0
find quote
k_zeon Offline
Senior Member
Posts: 217
Joined: Aug 2011
Reputation: 0
Post: #26
Eldorado Wrote:Many more updates:

- Renamed project to metahandler as metautils was interferring with Icefilms Smile

- Scraping of movies by Name and Year (optional) is fully working and so far seems quite accurate, scraping of movies by IMDB is always more ideal and faster

- TV Show Season scraping is working by either IMDB or TV Show name

eg.
Code:
metapath = xbmc.translatePath('special://profile/addon_data/script.module.metahandler/meta_cache')
    metaget=metahandlers.MetaData(metapath,preparezip = True)
    meta = metaget.get_meta('','tvshow','The Simpsons')
    print meta

Seaons and Episodes I believe will work fine as well, but haven't dug into that portion of the code yet to see what needs to be cleaned up

Once you have the TV Show scraped you *should* have the IMDB id

Season:
Code:
season_list = ['Season 1','Season 2']
    seasons = metaget.getSeasonCover(imdb_id, season_list, refresh=False)
    print seasons

Episode:
Code:
episode=metaget.get_episode_meta(imdb_id, season_num, episode_num)
    print episode

I'm hoping others can test this and mainly spot it's accuracy and areas where it can be improved, this is mainly dealing with scraping by only Movie/TV Show name - no IMDB

Question:
Currently the cache DB stores a number of items including the movie/tv show name.. as sites could potentially leave out special characters (or even spaces), and capitalization when spelling out the name it causes the search by name difficult

So for my string compares I have resorted to putting all strings to lower case and stripping all non alphanumerics using isalnum()

Looking for comments on whether I should also store the movie/tv show names with everything stripped and lower cased? The name is used to initially check if it exists in cache before it tries to scrape, and checks again before it tries to add a new entry to the DB


Hi Eldorado

Just about to have ago with your module and wanted to check something.
when you Renamed project to metahandler, i still see that the Addon.xml file still has <addon id="script.module.metautils" name="metautils" version="0.0.1"

is this supposed to be changed as well.?

tks
find quote
Eldorado Offline
Fan
Posts: 520
Joined: May 2009
Reputation: 14
Post: #27
k_zeon Wrote:Hi Eldorado

Just about to have ago with your module and wanted to check something.
when you Renamed project to metahandler, i still see that the Addon.xml file still has <addon id="script.module.metautils" name="metautils" version="0.0.1"

is this supposed to be changed as well.?

tks

Hmmm.. probably should! Smile

I have a couple more minor updates to commit, I'll put them up today

edit - just committed my latest changes and updated my first post with some quick how to's
(This post was last modified: 2011-10-02 17:56 by Eldorado.)
find quote
k_zeon Offline
Senior Member
Posts: 217
Joined: Aug 2011
Reputation: 0
Post: #28
Hi Eldorado

Just tried for the last 3 hours but have not been able to get it to work.
I pass the movie name and (year , getting rid of brackets)
I have tried adding infoLabels at different locations but errors out.

I did have a bit of success when adding img=infoLabels['thumb'] at the end
and looked like it was getting info but then after so many it would error out
ie 35 every time.

as you can see below i am not even adding anything to the add_directory as i am trying one step at a time but it still errors out.

the URL i am passing is http://tvsearch.co/movies/A/ and should get all movies starting with A

Any thoughts

Code:
html = net.http_GET(url).content
        match=re.compile('<li class="searchList"><a href="(.+?)">(.+?)</a> <span>(.+?)</span>').findall(html)
    
        metaget=metahandlers.MetaData()
        
        for url,name,sYear in match:
            sYear = sYear.replace('(','')
            sYear = sYear.replace(')','')
            name = name.replace(':','')
            
            meta = metaget.get_meta('', 'movie', name, sYear)
            infoLabels = create_infolabels(meta, name)
            addon.add_directory({'mode' : 'GetMovieSource', 'url' : url}, name , total_items=len(match))
find quote
Eldorado Offline
Fan
Posts: 520
Joined: May 2009
Reputation: 14
Post: #29
k_zeon Wrote:Hi Eldorado

Just tried for the last 3 hours but have not been able to get it to work.
I pass the movie name and (year , getting rid of brackets)
I have tried adding infoLabels at different locations but errors out.

I did have a bit of success when adding img=infoLabels['thumb'] at the end
and looked like it was getting info but then after so many it would error out
ie 35 every time.

as you can see below i am not even adding anything to the add_directory as i am trying one step at a time but it still errors out.

the URL i am passing is http://tvsearch.co/movies/A/ and should get all movies starting with A

Any thoughts

Code:
html = net.http_GET(url).content
        match=re.compile('<li class="searchList"><a href="(.+?)">(.+?)</a> <span>(.+?)</span>').findall(html)
    
        metaget=metahandlers.MetaData()
        
        for url,name,sYear in match:
            sYear = sYear.replace('(','')
            sYear = sYear.replace(')','')
            name = name.replace(':','')
            
            meta = metaget.get_meta('', 'movie', name, sYear)
            infoLabels = create_infolabels(meta, name)
            addon.add_directory({'mode' : 'GetMovieSource', 'url' : url}, name , total_items=len(match))

As of now you can't add metadata to directories using t0mm0's common library, though he might be working on adding that

You can only add it to video items using his add_video_items() method

Are you getting errors when retrieving metadata? Any errors I would like to see, can you post a log? I need to add more logging as well..

To see a working example, check out the 'meta' branch of Project Free TV in my repository: https://github.com/Eldorados/eldorado-xb.../tree/meta


Also, you can take a look inside the cache database using SQLiteSpy that is created in .../userdata/addon_data/script.module.metahandler/meta_cache/video_cache.db

Do a select on the table 'movie_meta' and you can see what has been found, a good result will have majority of the fields filled as well as imdb_id & tmdb_id's
(This post was last modified: 2011-10-04 03:39 by Eldorado.)
find quote
k_zeon Offline
Senior Member
Posts: 217
Joined: Aug 2011
Reputation: 0
Post: #30
I just check the db and yes the fields have been filled with info.

I still dont understand why it stops.

when scraping the A's it gets to 35 everytime
when scraping the Numbers it always stops at 3 ie the one after 1. Mai

i think it is to do with UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 32: ordinal not in range(128)
if i comment out the metahandler bits it load the menu's fine.


Code:
21:52:57 T:118956032 M:100352000  NOTICE: SQL SELECT
21:52:57 T:118956032 M:100352000  NOTICE: SELECT * FROM movie_meta WHERE imdb_id = 'tt1489877'
21:52:57 T:118956032 M:100352000  NOTICE: MATCHED ROW
21:52:57 T:118956032 M:100352000  NOTICE: None
21:52:57 T:118956032 M:100352000   DEBUG: TVSource: adding dir: 1 a Minute  (2010) - plugin://plugin.video.tvsource/?url=http%3A%2F%2Ftvsearch.co%2Fmovies%2Fwatch%2FP8PXD8MV&mode=GetMovieSource
21:52:57 T:118956032 M:100352000  NOTICE: http://api.themoviedb.org/2.1/Movie.search/en/json/b91e899ce561dd19695340c3b26e0a02/1 Day
21:52:59 T:118956032 M:100352000  NOTICE: http://api.themoviedb.org/2.1/Movie.getInfo/en/json/b91e899ce561dd19695340c3b26e0a02/30262
21:53:01 T:118956032 M:100352000  NOTICE: SQL SELECT
21:53:01 T:118956032 M:100352000  NOTICE: SELECT * FROM movie_meta WHERE imdb_id = 'tt1268158'
21:53:01 T:118956032 M:100352000  NOTICE: MATCHED ROW
21:53:01 T:118956032 M:100352000  NOTICE: None
21:53:01 T:118956032 M:100352000   DEBUG: TVSource: adding dir: 1 Day  (2009) - plugin://plugin.video.tvsource/?url=http%3A%2F%2Ftvsearch.co%2Fmovies%2Fwatch%2FR7T7SXRZ&mode=GetMovieSource
21:53:01 T:118956032 M:100352000  NOTICE: http://api.themoviedb.org/2.1/Movie.search/en/json/b91e899ce561dd19695340c3b26e0a02/1. Mai
21:53:02 T:118956032 M:100352000  NOTICE: http://api.themoviedb.org/2.1/Movie.getInfo/en/json/b91e899ce561dd19695340c3b26e0a02/8209
21:53:03 T:118956032 M:100352000  NOTICE: SQL SELECT
21:53:03 T:118956032 M:100352000  NOTICE: SELECT * FROM movie_meta WHERE imdb_id = 'tt1020932'
21:53:03 T:118956032 M:100352000  NOTICE: MATCHED ROW
21:53:03 T:118956032 M:100352000  NOTICE: None
21:53:03 T:118956032 M:100352000    INFO: -->Python script returned the following error<--
21:53:03 T:118956032 M:100352000   ERROR: Error Type: <type 'exceptions.UnicodeEncodeError'>
21:53:03 T:118956032 M:100352000   ERROR: Error Contents: 'ascii' codec can't encode character u'\xd6' in position 32: ordinal not in range(128)
21:53:03 T:118956032 M:100352000   ERROR: Traceback (most recent call last):
                                              File "/var/mobile/Library/Preferences/XBMC/addons/plugin.video.tvsource/default.py", line 207, in <module>
                                                infoLabels = create_infolabels(meta, name)
                                              File "/var/mobile/Library/Preferences/XBMC/addons/plugin.video.tvsource/default.py", line 34, in create_infolabels
                                                infoLabels['plot'] = str(meta['plot'])
                                            UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 32: ordinal not in range(128)
21:53:03 T:118956032 M:100352000    INFO: -->End of Python script error report<--

tks
(This post was last modified: 2011-10-04 09:22 by k_zeon.)
find quote
Post Reply