py+soup basic parse question - Printable Version

py+soup basic parse question - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Add-ons (https://forum.kodi.tv/forumdisplay.php?fid=26)
+--- Thread: py+soup basic parse question (/showthread.php?tid=139755)

py+soup basic parse question - wellspokenman - 2012-09-04

Hi,
I'm pretty comfortable with bash, cut, grep and awk, but doing the same stuff in py+soup is doing my head in. So far I can fetch the 'desc' class from an IMDB watch list, but I cant turn it into 'keys' or 'variables' that I can do anything useful with. Here is my basic tutorial script:

Code:
from bs4 import BeautifulSoup

from mechanize import Browser

import urllib2

import re

url="http://www.imdb.com/user/ur35645275/watchlist"

page=urllib2.urlopen(url)

soup = BeautifulSoup(page.read())

movies=soup.findAll('div',{'class':'desc'})

for eachmovie in movies:

#    print eachmovie['href']+","+eachmovie.string

    print eachmovie

This will output something like:

Quote:<div class="desc">
<a href="/title/tt0187078/">Gone in Sixty Seconds</a>
</div>
<div class="desc">
<a href="/title/tt0477472/">Solo</a>
</div>
<div class="desc">
<a href="/title/tt0086250/">Scarface</a>
</div>
<div class="desc">
<a href="/title/tt0072890/">Dog Day Afternoon</a>
</div>

Which is cool and all, but I want clean strings I can feed into xbmc.

Can anyone help me carve this text up into something useful?

------to be more specific, I want: IMDB ID (the tt string (^tt[0-9]{7} as regex)), the imdb URL (/title/id/), the title and of course the thumbnail. (imdbid,url,title,thumnail).
I have imdbpy, which is great for fetching stuff once I have a name or an ID, but here I just want that info for a given watchlist.

RE: py+soup basic parse question - Beenje - 2012-09-04

Check eachmovie.a['href'] and eachmovie.a.string (or eachmovie.text)

RE: py+soup basic parse question - wellspokenman - 2012-09-05

.text works and returns the title, very neatly.

.a.string produces an error:
print "HREF=" + eachmovie.a.string
AttributeError: 'NoneType' object has no attribute 'string'

.a['href'] returns:
print "HREF=" + eachmovie.a['href']
TypeError: 'NoneType' object is not subscriptable

RE: py+soup basic parse question - wellspokenman - 2012-09-05

tried this too:
print "HREF=" + eachmovie.index

which returns:
<bound method Tag.index of <div class="desc">
<a href="/title/tt0072890/">Dog Day Afternoon</a>
</div>>

it's as if there are no tags inside my selection....will keep at it.

RE: py+soup basic parse question - wellspokenman - 2012-09-05

got it:

print eachmovie.text
print eachmovie.a['href']
print eachmovie.img

after changing the soup find to:

movies=soup.findAll('div',{'class':'list_item'})

returns:
Dog Day Afternoon

/title/tt0072890/
<img class="loadlate hidden zero-z-index" height="209" loadlate="http://ia.media-imdb.com/images/M/MV5BMTQyNjQ5NjczM15BMl5BanBnXkFtZTYwNDA4MTk4._V1._SY209_CR1,0,140,209_.jpg" src="http://i.media-imdb.com/images/SFaa265aa19162c9e4f3781fbae59f856d/nopicture/medium/film.png" width="140"/>
htpc@xbmc:~/scripts/wip$

RE: py+soup basic parse question - divingmule - 2012-09-05

Code:
soup = BeautifulSoup(link, convertEntities=BeautifulSoup.HTML_ENTITIES)

items = soup('div', attrs={'class' : "list_item grid"})

for i in items:

    thumb = i.a.img['src']

    name = i('a')[1].string

    href = i('a')[1]['href']

    print(name, href, thumb)