Html table parsing
#1
hi all,

im really struggling on trring to code something to parse html and extract tables - and then redisplay in a python screen:

can anyone help !?!?? :-) :help:

if it helps - that actual page i`m trying to parse ive mirrored below:

http://www.dragon256.plus.com/timer.html

Quote: <table width='100%' cellpadding='1'>
<tr>
<td class='grid' align='left'>name</td>
<td class='grid' align='left'>channel</td>
<td class='grid' align='left'>date</td>
<td class='grid' align='left'>start</td>
<td class='grid' align='left'>stop</td>
<td class='grid' align='left'>delete</td>
</tr>

<tr>
<td class='normal' align='left'><a href='javascript:ontimer (1)'>bbc news</a></td>
<td class='normal' align='left'>bbc one</td>
<td class='normal' align='left'>25-06-2005</td>
<td class='normal' align='left'>17:00</td>
<td class='normal' align='left'>17:10</td>
<td class='normal' align='middle'><input type='checkbox' onclick='ondelete (1)'></td>
</tr>
<tr>
<td class='normal' align='left'><a href='javascript:ontimer (2)'>bigbrother</a></td>
<td class='normal' align='left'>channel 4</td>

<td class='normal' align='left'>25-06-2005</td>
<td class='normal' align='left'>21:10</td>
<td class='normal' align='left'>22:05</td>
<td class='normal' align='middle'><input type='checkbox' onclick='ondelete (2)'></td>
</tr>
<tr>
<td class='normal' align='left'><a href='javascript:ontimer (3)'>bbc news</a></td>
<td class='normal' align='left'>bbc one</td>
<td class='normal' align='left'>25-06-2005</td>
<td class='normal' align='left'>22:45</td>

<td class='normal' align='left'>23:05</td>
<td class='normal' align='middle'><input type='checkbox' onclick='ondelete (3)'></td>
</tr>
</table>
Reply
#2
just did some quick coding before going to bed. it's not exactly good, but you'll get an idea how to solve it (hopefully):

Quote:import urllib, re

f = urllib.urlopen("http://www.dragon256.plus.com/timer.html")
data = f.read()

query = re.compile("<tr>(.*?)</tr>", re.ignorecase | re.dotall)
lists = re.findall(query, data)

program= []
for x in lists:
query2 = re.compile("left'>(.*?)</", re.ignorecase | re.dotall)
lists2 = re.findall(query2, x)
program.append(lists2)

print program

this will produce this:

Quote:[['name', 'channel', 'date', 'start', 'stop', 'delete'], ["<a href='javascript:ontimer (1)'>glastonbury 2005", 'bbc thre
e', '25-06-2005', '19:00', '21:00'], ["<a href='javascript:ontimer (2)'>bigbrother", 'channel 4', '25-06-2005', '21:10',
'22:05'], ["<a href='javascript:ontimer (3)'>glastonbury 2005", 'bbc three', '26-06-2005', '19:00', '21:00'], ["<a href
='javascript:ontimer (4)'>glastonbury 2005", 'bbc three', '26-06-2005', '21:00', '02:00']]

you see it's not perfect, but you need to adjust the regex. that's all the time i was willing to spend on it at this time...good luck!
xbmcscripts.com administrator
Reply
#3
thanks ! - i'll have a go with the code above and see how it goes.
Cool
Reply

Logout Mark Read Team Forum Stats Members Help
Html table parsing0