script.module.metautils dev

  Thread Rating:
  • 2 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Eldorado Offline
Fan
Posts: 497
Joined: May 2009
Reputation: 11
Post: #21
k_zeon Wrote:Not had a chance yet, family's had a cold all week.
Would like to have a go sometime this week

so basically i would use:

meta = metaget.get_meta(imdb,type,name, year)

add_video_item({'url': url},{meta},img=thumb)

and would i have to get the Pic url as well then pass to img

Currently you can't simply pass 'meta' returned into your add_video_item, for some reason the original programmer didn't give them the proper names that xbmc expects, something I should fix.. also I'm not sure how xbmc will react to all the extra values being returned which are not valid infolabels

For now you need to parse out what is returned into your own dict

eg.

infoLabels = {}
infoLabels['genre'] = meta['genres']
etc..

add_video_item({'url': url},infoLabels,img=meta['cover'], fanart=meta['fanart'])
find quote
Eldorado Offline
Fan
Posts: 497
Joined: May 2009
Reputation: 11
Post: #22
Many more updates:

- Renamed project to metahandler as metautils was interferring with Icefilms Smile

- Scraping of movies by Name and Year (optional) is fully working and so far seems quite accurate, scraping of movies by IMDB is always more ideal and faster

- TV Show Season scraping is working by either IMDB or TV Show name

eg.
Code:
metapath = xbmc.translatePath('special://profile/addon_data/script.module.metahandler/meta_cache')
    metaget=metahandlers.MetaData(metapath,preparezip = True)
    meta = metaget.get_meta('','tvshow','The Simpsons')
    print meta

Seaons and Episodes I believe will work fine as well, but haven't dug into that portion of the code yet to see what needs to be cleaned up

Once you have the TV Show scraped you *should* have the IMDB id

Season:
Code:
season_list = ['Season 1','Season 2']
    seasons = metaget.getSeasonCover(imdb_id, season_list, refresh=False)
    print seasons

Episode:
Code:
episode=metaget.get_episode_meta(imdb_id, season_num, episode_num)
    print episode

I'm hoping others can test this and mainly spot it's accuracy and areas where it can be improved, this is mainly dealing with scraping by only Movie/TV Show name - no IMDB

Question:
Currently the cache DB stores a number of items including the movie/tv show name.. as sites could potentially leave out special characters (or even spaces), and capitalization when spelling out the name it causes the search by name difficult

So for my string compares I have resorted to putting all strings to lower case and stripping all non alphanumerics using isalnum()

Looking for comments on whether I should also store the movie/tv show names with everything stripped and lower cased? The name is used to initially check if it exists in cache before it tries to scrape, and checks again before it tries to add a new entry to the DB
(This post was last modified: 2011-09-30 01:40 by Eldorado.)
find quote
t0mm0 Offline
Fan
Posts: 521
Joined: Mar 2011
Reputation: 8
Location: UK
Post: #23
Eldorado Wrote:Many more updates:
looks like this is going really well eldorado, sorry i have not helped out much but i am really busy at the momentSad
Eldorado Wrote:I'm hoping others can test this and mainly spot it's accuracy and areas where it can be improved, this is mainly dealing with scraping by only Movie/TV Show name - no IMDB
i'll try adding support in letmewatchthis when i get a chance and see how it goes!
Eldorado Wrote:Question:
Currently the cache DB stores a number of items including the movie/tv show name.. as sites could potentially leave out special characters (or even spaces), and capitalization when spelling out the name it causes the search by name difficult

So for my string compares I have resorted to putting all strings to lower case and stripping all non alphanumerics using isalnum()

Looking for comments on whether I should also store the movie/tv show names with everything stripped and lower cased? The name is used to initially check if it exists in cache before it tries to scrape, and checks again before it tries to add a new entry to the DB

if you need to query the db using the stripped form then i would say it is probably worth putting it in the database.

i guess you need to think carefully about whether you end up with duplicate names once you do the stripping and how that is dealt with? i presume there is just no way of telling them apart anyway. maybe try exact matches first?

maybe normalizing unicode to ascii would be better (like i do for logging) than just stripping stuff (guess that wouldn't help for spaces in different places though)

t0mm0.
find quote
Eldorado Offline
Fan
Posts: 497
Joined: May 2009
Reputation: 11
Post: #24
t0mm0 Wrote:looks like this is going really well eldorado, sorry i have not helped out much but i am really busy at the momentSadi'll try adding support in letmewatchthis when i get a chance and see how it goes!

if you need to query the db using the stripped form then i would say it is probably worth putting it in the database.

i guess you need to think carefully about whether you end up with duplicate names once you do the stripping and how that is dealt with? i presume there is just no way of telling them apart anyway. maybe try exact matches first?

maybe normalizing unicode to ascii would be better (like i do for logging) than just stripping stuff (guess that wouldn't help for spaces in different places though)

t0mm0.

Sweet, I'm assuming you don't have imdb id's with letmewatchthis? The scraping of movies/tv shows by just the name is what I'm interested in, by IMDB is pretty straight forward

When searching by name:

- a movie it will take the first match
- tvshow will scan the results and look for an exact title match

I'm thinking maybe I should save the movie/tv show name all stripped down

Using .isalnum() is kinda handy as it removes everything but alphanumerics

I do have some issues with unicode stuff right now too, some plot's being returned having characters in with accents etc. that it just doesn't like.. seems to be some hacky type code in right now to try and correct it, so just a heads up in case you get errors of the sort Smile
find quote
t0mm0 Offline
Fan
Posts: 521
Joined: Mar 2011
Reputation: 8
Location: UK
Post: #25
Eldorado Wrote:Sweet, I'm assuming you don't have imdb id's with letmewatchthis? The scraping of movies/tv shows by just the name is what I'm interested in, by IMDB is pretty straight forward
yeah it does show imdb numbers once you get to the sources page, but that doesn't get fetched until you press play so is useless unless you fetch the sourcs page for every movie in the list which seems like a bad idea! i think that is the case on a lot of sites - we need a campaign to get sites to include the imdb code on their search results/browse pages Wink.
Eldorado Wrote:When searching by name:

- a movie it will take the first match
- tvshow will scan the results and look for an exact title match

I'm thinking maybe I should save the movie/tv show name all stripped down

Using .isalnum() is kinda handy as it removes everything but alphanumerics
makes sense, i'll see how it works for me and let you know
Eldorado Wrote:I do have some issues with unicode stuff right now too, some plot's being returned having characters in with accents etc. that it just doesn't like.. seems to be some hacky type code in right now to try and correct it, so just a heads up in case you get errors of the sort Smile

are you using t0mm0.common.net? that should return properly encoded unicode so you don't have to worry (unless the webpage lies about the character set in use as we discovered before, i still need to commit the fix/workaround for that). if it doesn't then let me know and it'll be a useful test case...

t0mm0
find quote
k_zeon Offline
Senior Member
Posts: 193
Joined: Aug 2011
Reputation: 0
Post: #26
Eldorado Wrote:Many more updates:

- Renamed project to metahandler as metautils was interferring with Icefilms Smile

- Scraping of movies by Name and Year (optional) is fully working and so far seems quite accurate, scraping of movies by IMDB is always more ideal and faster

- TV Show Season scraping is working by either IMDB or TV Show name

eg.
Code:
metapath = xbmc.translatePath('special://profile/addon_data/script.module.metahandler/meta_cache')
    metaget=metahandlers.MetaData(metapath,preparezip = True)
    meta = metaget.get_meta('','tvshow','The Simpsons')
    print meta

Seaons and Episodes I believe will work fine as well, but haven't dug into that portion of the code yet to see what needs to be cleaned up

Once you have the TV Show scraped you *should* have the IMDB id

Season:
Code:
season_list = ['Season 1','Season 2']
    seasons = metaget.getSeasonCover(imdb_id, season_list, refresh=False)
    print seasons

Episode:
Code:
episode=metaget.get_episode_meta(imdb_id, season_num, episode_num)
    print episode

I'm hoping others can test this and mainly spot it's accuracy and areas where it can be improved, this is mainly dealing with scraping by only Movie/TV Show name - no IMDB

Question:
Currently the cache DB stores a number of items including the movie/tv show name.. as sites could potentially leave out special characters (or even spaces), and capitalization when spelling out the name it causes the search by name difficult

So for my string compares I have resorted to putting all strings to lower case and stripping all non alphanumerics using isalnum()

Looking for comments on whether I should also store the movie/tv show names with everything stripped and lower cased? The name is used to initially check if it exists in cache before it tries to scrape, and checks again before it tries to add a new entry to the DB


Hi Eldorado

Just about to have ago with your module and wanted to check something.
when you Renamed project to metahandler, i still see that the Addon.xml file still has <addon id="script.module.metautils" name="metautils" version="0.0.1"

is this supposed to be changed as well.?

tks
find quote
Eldorado Offline
Fan
Posts: 497
Joined: May 2009
Reputation: 11
Post: #27
k_zeon Wrote:Hi Eldorado

Just about to have ago with your module and wanted to check something.
when you Renamed project to metahandler, i still see that the Addon.xml file still has <addon id="script.module.metautils" name="metautils" version="0.0.1"

is this supposed to be changed as well.?

tks

Hmmm.. probably should! Smile

I have a couple more minor updates to commit, I'll put them up today

edit - just committed my latest changes and updated my first post with some quick how to's
(This post was last modified: 2011-10-02 17:56 by Eldorado.)
find quote
k_zeon Offline
Senior Member
Posts: 193
Joined: Aug 2011
Reputation: 0
Post: #28
Hi Eldorado

Just tried for the last 3 hours but have not been able to get it to work.
I pass the movie name and (year , getting rid of brackets)
I have tried adding infoLabels at different locations but errors out.

I did have a bit of success when adding img=infoLabels['thumb'] at the end
and looked like it was getting info but then after so many it would error out
ie 35 every time.

as you can see below i am not even adding anything to the add_directory as i am trying one step at a time but it still errors out.

the URL i am passing is http://tvsearch.co/movies/A/ and should get all movies starting with A

Any thoughts

Code:
html = net.http_GET(url).content
        match=re.compile('<li class="searchList"><a href="(.+?)">(.+?)</a> <span>(.+?)</span>').findall(html)
    
        metaget=metahandlers.MetaData()
        
        for url,name,sYear in match:
            sYear = sYear.replace('(','')
            sYear = sYear.replace(')','')
            name = name.replace(':','')
            
            meta = metaget.get_meta('', 'movie', name, sYear)
            infoLabels = create_infolabels(meta, name)
            addon.add_directory({'mode' : 'GetMovieSource', 'url' : url}, name , total_items=len(match))
find quote
Eldorado Offline
Fan
Posts: 497
Joined: May 2009
Reputation: 11
Post: #29
k_zeon Wrote:Hi Eldorado

Just tried for the last 3 hours but have not been able to get it to work.
I pass the movie name and (year , getting rid of brackets)
I have tried adding infoLabels at different locations but errors out.

I did have a bit of success when adding img=infoLabels['thumb'] at the end
and looked like it was getting info but then after so many it would error out
ie 35 every time.

as you can see below i am not even adding anything to the add_directory as i am trying one step at a time but it still errors out.

the URL i am passing is http://tvsearch.co/movies/A/ and should get all movies starting with A

Any thoughts

Code:
html = net.http_GET(url).content
        match=re.compile('<li class="searchList"><a href="(.+?)">(.+?)</a> <span>(.+?)</span>').findall(html)
    
        metaget=metahandlers.MetaData()
        
        for url,name,sYear in match:
            sYear = sYear.replace('(','')
            sYear = sYear.replace(')','')
            name = name.replace(':','')
            
            meta = metaget.get_meta('', 'movie', name, sYear)
            infoLabels = create_infolabels(meta, name)
            addon.add_directory({'mode' : 'GetMovieSource', 'url' : url}, name , total_items=len(match))

As of now you can't add metadata to directories using t0mm0's common library, though he might be working on adding that

You can only add it to video items using his add_video_items() method

Are you getting errors when retrieving metadata? Any errors I would like to see, can you post a log? I need to add more logging as well..

To see a working example, check out the 'meta' branch of Project Free TV in my repository: https://github.com/Eldorados/eldorado-xb.../tree/meta


Also, you can take a look inside the cache database using SQLiteSpy that is created in .../userdata/addon_data/script.module.metahandler/meta_cache/video_cache.db

Do a select on the table 'movie_meta' and you can see what has been found, a good result will have majority of the fields filled as well as imdb_id & tmdb_id's
(This post was last modified: 2011-10-04 03:39 by Eldorado.)
find quote
k_zeon Offline
Senior Member
Posts: 193
Joined: Aug 2011
Reputation: 0
Post: #30
I just check the db and yes the fields have been filled with info.

I still dont understand why it stops.

when scraping the A's it gets to 35 everytime
when scraping the Numbers it always stops at 3 ie the one after 1. Mai

i think it is to do with UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 32: ordinal not in range(128)
if i comment out the metahandler bits it load the menu's fine.


Code:
21:52:57 T:118956032 M:100352000  NOTICE: SQL SELECT
21:52:57 T:118956032 M:100352000  NOTICE: SELECT * FROM movie_meta WHERE imdb_id = 'tt1489877'
21:52:57 T:118956032 M:100352000  NOTICE: MATCHED ROW
21:52:57 T:118956032 M:100352000  NOTICE: None
21:52:57 T:118956032 M:100352000   DEBUG: TVSource: adding dir: 1 a Minute  (2010) - plugin://plugin.video.tvsource/?url=http%3A%2F%2Ftvsearch.co%2Fmovies%2Fwatch%2FP8PXD8MV&mode=GetMovieSource
21:52:57 T:118956032 M:100352000  NOTICE: http://api.themoviedb.org/2.1/Movie.search/en/json/b91e899ce561dd19695340c3b26e0a02/1 Day
21:52:59 T:118956032 M:100352000  NOTICE: http://api.themoviedb.org/2.1/Movie.getInfo/en/json/b91e899ce561dd19695340c3b26e0a02/30262
21:53:01 T:118956032 M:100352000  NOTICE: SQL SELECT
21:53:01 T:118956032 M:100352000  NOTICE: SELECT * FROM movie_meta WHERE imdb_id = 'tt1268158'
21:53:01 T:118956032 M:100352000  NOTICE: MATCHED ROW
21:53:01 T:118956032 M:100352000  NOTICE: None
21:53:01 T:118956032 M:100352000   DEBUG: TVSource: adding dir: 1 Day  (2009) - plugin://plugin.video.tvsource/?url=http%3A%2F%2Ftvsearch.co%2Fmovies%2Fwatch%2FR7T7SXRZ&mode=GetMovieSource
21:53:01 T:118956032 M:100352000  NOTICE: http://api.themoviedb.org/2.1/Movie.search/en/json/b91e899ce561dd19695340c3b26e0a02/1. Mai
21:53:02 T:118956032 M:100352000  NOTICE: http://api.themoviedb.org/2.1/Movie.getInfo/en/json/b91e899ce561dd19695340c3b26e0a02/8209
21:53:03 T:118956032 M:100352000  NOTICE: SQL SELECT
21:53:03 T:118956032 M:100352000  NOTICE: SELECT * FROM movie_meta WHERE imdb_id = 'tt1020932'
21:53:03 T:118956032 M:100352000  NOTICE: MATCHED ROW
21:53:03 T:118956032 M:100352000  NOTICE: None
21:53:03 T:118956032 M:100352000    INFO: -->Python script returned the following error<--
21:53:03 T:118956032 M:100352000   ERROR: Error Type: <type 'exceptions.UnicodeEncodeError'>
21:53:03 T:118956032 M:100352000   ERROR: Error Contents: 'ascii' codec can't encode character u'\xd6' in position 32: ordinal not in range(128)
21:53:03 T:118956032 M:100352000   ERROR: Traceback (most recent call last):
                                              File "/var/mobile/Library/Preferences/XBMC/addons/plugin.video.tvsource/default.py", line 207, in <module>
                                                infoLabels = create_infolabels(meta, name)
                                              File "/var/mobile/Library/Preferences/XBMC/addons/plugin.video.tvsource/default.py", line 34, in create_infolabels
                                                infoLabels['plot'] = str(meta['plot'])
                                            UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 32: ordinal not in range(128)
21:53:03 T:118956032 M:100352000    INFO: -->End of Python script error report<--

tks
(This post was last modified: 2011-10-04 09:22 by k_zeon.)
find quote
Post Reply