py+soup basic parse question - Printable Version +- Kodi Community Forum (https://forum.kodi.tv) +-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32) +--- Forum: Add-ons (https://forum.kodi.tv/forumdisplay.php?fid=26) +--- Thread: py+soup basic parse question (/showthread.php?tid=139755) |
py+soup basic parse question - wellspokenman - 2012-09-04 Hi, I'm pretty comfortable with bash, cut, grep and awk, but doing the same stuff in py+soup is doing my head in. So far I can fetch the 'desc' class from an IMDB watch list, but I cant turn it into 'keys' or 'variables' that I can do anything useful with. Here is my basic tutorial script: Code: from bs4 import BeautifulSoup This will output something like: Quote:<div class="desc">Which is cool and all, but I want clean strings I can feed into xbmc. Can anyone help me carve this text up into something useful? ------to be more specific, I want: IMDB ID (the tt string (^tt[0-9]{7} as regex)), the imdb URL (/title/id/), the title and of course the thumbnail. (imdbid,url,title,thumnail). I have imdbpy, which is great for fetching stuff once I have a name or an ID, but here I just want that info for a given watchlist. RE: py+soup basic parse question - Beenje - 2012-09-04 Check eachmovie.a['href'] and eachmovie.a.string (or eachmovie.text) RE: py+soup basic parse question - wellspokenman - 2012-09-05 .text works and returns the title, very neatly. .a.string produces an error: print "HREF=" + eachmovie.a.string AttributeError: 'NoneType' object has no attribute 'string' .a['href'] returns: print "HREF=" + eachmovie.a['href'] TypeError: 'NoneType' object is not subscriptable RE: py+soup basic parse question - wellspokenman - 2012-09-05 tried this too: print "HREF=" + eachmovie.index which returns: <bound method Tag.index of <div class="desc"> <a href="/title/tt0072890/">Dog Day Afternoon</a> </div>> it's as if there are no tags inside my selection....will keep at it. RE: py+soup basic parse question - wellspokenman - 2012-09-05 got it: print eachmovie.text print eachmovie.a['href'] print eachmovie.img after changing the soup find to: movies=soup.findAll('div',{'class':'list_item'}) returns: Dog Day Afternoon /title/tt0072890/ <img class="loadlate hidden zero-z-index" height="209" loadlate="http://ia.media-imdb.com/images/M/MV5BMTQyNjQ5NjczM15BMl5BanBnXkFtZTYwNDA4MTk4._V1._SY209_CR1,0,140,209_.jpg" src="http://i.media-imdb.com/images/SFaa265aa19162c9e4f3781fbae59f856d/nopicture/medium/film.png" width="140"/> htpc@xbmc:~/scripts/wip$ RE: py+soup basic parse question - divingmule - 2012-09-05 Code: soup = BeautifulSoup(link, convertEntities=BeautifulSoup.HTML_ENTITIES) |