What html parser lib should i use?
#1
Question 
I've started some development on parsing svtplay.se again since the old version of my addon used an deprecated and unofficial xml api.

First I started with BeautifulSoup but failed miserably since it doesn't handle the new block tags in html 5, article etc.

Then I've had a look at html5lib but the documentation isn't that good, well actually it sucks!

So I'm wondering if anyone has an idea of a good parser that handles html 5 or should I do it with re?

The only issue I can see with re is that other developers will have a hard time sending med PRConfused if they have to read my regular expressions Wink
Reply
#2
use parseDOM (despite the name, it's not a DOM parser but it use plain regex so it can handle any tag)
Reply
#3
Forgot to mention in the first post that I tried it to but, was unable to get regex selectors to work in it ex ".*svtTab-active.*". But I will try it again...
Reply
#4
I was also a bit confused on how to select the text and the href from an a tag, do I have to use two separate collections?

text = common.parseDOM(html, "a", attrs = { "class": "foo" })
href = common.parseDOM(html, "a", attrs = { "class": "foo" }, ret = "href")
Reply
#5
regex will work, but remember it goes directly inside another regex so using .* is a bad idea. Use something stricter like [^"']+
Reply
#6
BeautifulSoup and pyquery is what I have used when doing heimdall (my gsoc).

pyquery is just jquery so it takes a tree and make it real simple to traverse (css selections), so it relies on someone else to understand the tree. IIRC it uses elementtree and has support for (what most consider) broken html (non ending tags etc.).

BeautifulSoup does both traverse and understanding of the tree, it supports broken html and tries to fix it up best it can.

Both of these libraries is geared against any HTML, so not HTML5 specific. They should however work just fine with HTML5, since its just better structured than the old.

And the examples of parseDOM nilzen gives is essentially how BeautifulSoup works aswell.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#7
topfs2 sadly BeautifulSoup doesn't understand the HTML5 on svt.se the <article> tags aren't nested inside the containing <div> since it isn't included in the NESTABLE_BLOCK_TAGS touple. ('blockquote', 'div', 'fieldset', 'ins', 'del')

Perhaps pyquery in combination with html5lib is a nice solution, will check it out tonight...
Reply
#8
Oh sorry, I missed that you had tried beautifulsoup already Smile

You can edit the nestable block tags afaik?
From http://www.mobileread.com/forums/archive...87318.html
Code:
for x in ['article', 'aside', 'header', 'footer', 'nav', 'figcaption', 'figure', 'section']:
    BeautifulSoup.NESTABLE_BLOCK_TAGS.append(x)
    BeautifulSoup.RESET_NESTING_TAGS[x]=None
    BeautifulSoup.NESTABLE_TAGS[x]=[]
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#9
But seriously, use parseDOM if it can handle your site. It's faster, requires far less boilerplate and usually gives you a lot more beautiful code than "beautiful"soup. You and your maintainers will thank me later..
Reply
#10
Tried that fix too but it throws an error, at least for the version included in the addon repo: 'tuple' object has no attribute 'append'
Reply
#11
What's wrong with just using regex on the page source?
Reply
#12
Maintainability I would say
Reply
#13
Beautifulsoup 4? Did you try that?
Doc: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Download: http://www.crummy.com/software/Beautiful.../download/
Reply
#14
Maybe this is offtopic, but...

Before you consider scraping HTML double check that the website/webservice really has no JSON or XML API. Most bigger services have one. For a quick check just have a look in Apple App Store or Google Play. If the website has an Smartphone App, it definitely has an (maybe hidden) API.
My GitHub. My Add-ons:
Image
Reply

Logout Mark Read Team Forum Stats Members Help
What html parser lib should i use?0