Script to list links inside <div class=""> beginner help
#1
I am just a beginner and would appreciate if somebody can help with generating a list. I am actually editing Macedoniaondemand (big thank you to the author) python script I have made few changes and they are working but I cannot figure out how to get the series that are only in <div class="all_shows"> to be displayed.

I have been reading the wiki to figure out but no luck

<div class="all_shows">

<ul data-letter="A">

<li><a href="http://www.hrt.hr/enz/abeceda-zdravlja/">Abeceda zdravlja</a></li>

<li><a href="http://www.hrt.hr/enz/alpe-dunav-jadran/">Alpe Dunav Jadran</a></li>

</ul>

This is the relevant python script


def createHRTSeriesListing():
url='http://www.hrt.hr/enz'
req = urllib2.Request(url)
req.add_header('User-Agent', user_agent)
response = urllib2.urlopen(req)
link=response.read()
response.close()
match=re.compile('<li><a href="(.+?)">(.+?)</a></li>').findall(link)
return match


match=re.compile('<li><a href="(.+?)">(.+?)</a></li>').findall(link) lists all the links which is exactly what it should do but I want it to only
display inside this <div class="all_shows">

Thank you
Reply
#2
Personally, I wouldn't use regex to try and scrape urls from a website. BeautifulSoup would be my choice.

I'm happy to post an example of how to get the links you want but won't be able to do this until later today.
BBC Live Football Scores: Football scores notifications service.
script.squeezeinfo: shows what's playing on your Logitech Media Server
Reply
#3
I would really appreciate the example. I have come across BeautifulSoup in my python readings I will do some more reading on it.
Reply
#4
OK, I'd do something like this:
Code:
from BeautifulSoup import BeautifulSoup
import urllib2

url='http://www.hrt.hr/enz'
req = urllib2.Request(url)
req.add_header('User-Agent', user_agent)
response = urllib2.urlopen(req)

page=BeautifulSoup(response.read())
response.close()

links=page.find("div", {"class": "all_shows"}).findAll("a")

for link in links:
  print link.get("href")

You should also see this thread about which version of BeautifulSoup to use and how to import it.

I assume you've defined "user_agent" elsewhere, otherwise that line will, I think, fail.

Apologies, I can't test the code above but I think it should work!
BBC Live Football Scores: Football scores notifications service.
script.squeezeinfo: shows what's playing on your Logitech Media Server
Reply
#5
Thank you that worked in the debug logs but on the screen it didn't there were further two errors

line 1277, PROCESS_PAGE(page, url, name)

line 1165, for link, title in listing:


I have emailed Macedioniaondemand author so hopefully he can correct the error for the Croatian part.
Reply
#6
My code was just an example to show you how to get the links using BeautifulSoup and print these (to the xbmc log). I've no idea how the script your using intends to use the links but hopefully you'll be able to sort this.

Good luck.
BBC Live Football Scores: Football scores notifications service.
script.squeezeinfo: shows what's playing on your Logitech Media Server
Reply
#7
Finally figured it out Beautifulsoup wasn't outputting a string so after converting to a string it worked.

def createHRTSeriesListing():
url='http://www.hrt.hr/enz'
req = urllib2.Request(url)
req.add_header('User-Agent', user_agent)
response = urllib2.urlopen(req)
page=BeautifulSoup(response.read())
response.close()
link=page.find("div", {"class": "all_shows"}).findAll("li")
str1 = ''.join(str(e) for e in link)

match=re.compile('<li><a href="(.+?)">(.+?)</a></li>').findall(str1)
return match
Reply
#8
You've gone back to using regex at the end which is what I was trying to avoid. sticking with BeautifulSoup, you can do this:
Code:
def createHRTSeriesListing():
  url='http://www.hrt.hr/enz'
  req = urllib2.Request(url)
  req.add_header('User-Agent', user_agent)
  response = urllib2.urlopen(req)
  page=BeautifulSoup(response.read())
  response.close()
  links=page.find("div", {"class": "all_shows"}).findAll("li")
  match = [(link.find("a").get("href"), link.text) for link in links]
  return match

That will return a list of tuples being (link, text) - the same as the output from your regex code.
BBC Live Football Scores: Football scores notifications service.
script.squeezeinfo: shows what's playing on your Logitech Media Server
Reply
#9
This is exactly what I've been looking for, I assume this code goes on your addon.py file of your plugin? Would you mind posting the complete code for the file including the import section? Thank you
Reply

Logout Mark Read Team Forum Stats Members Help
Script to list links inside <div class=""> beginner help0