Posts: 62
Joined: Dec 2009
Reputation:
2
Forgot to mention in the first post that I tried it to but, was unable to get regex selectors to work in it ex ".*svtTab-active.*". But I will try it again...
Posts: 62
Joined: Dec 2009
Reputation:
2
I was also a bit confused on how to select the text and the href from an a tag, do I have to use two separate collections?
text = common.parseDOM(html, "a", attrs = { "class": "foo" })
href = common.parseDOM(html, "a", attrs = { "class": "foo" }, ret = "href")
Posts: 1,265
Joined: Oct 2009
Reputation:
29
takoi
Team-Kodi Member
Posts: 1,265
regex will work, but remember it goes directly inside another regex so using .* is a bad idea. Use something stricter like [^"']+
Posts: 4,549
Joined: Dec 2007
Reputation:
17
topfs2
Team-Kodi Developer
Posts: 4,549
BeautifulSoup and pyquery is what I have used when doing heimdall (my gsoc).
pyquery is just jquery so it takes a tree and make it real simple to traverse (css selections), so it relies on someone else to understand the tree. IIRC it uses elementtree and has support for (what most consider) broken html (non ending tags etc.).
BeautifulSoup does both traverse and understanding of the tree, it supports broken html and tries to fix it up best it can.
Both of these libraries is geared against any HTML, so not HTML5 specific. They should however work just fine with HTML5, since its just better structured than the old.
And the examples of parseDOM nilzen gives is essentially how BeautifulSoup works aswell.
If you have problems please read
this before posting
Always read the
XBMC online-manual,
FAQ and
search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the
forum rules.
For troubleshooting and bug reporting please make sure you
read this first.
"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Posts: 62
Joined: Dec 2009
Reputation:
2
topfs2 sadly BeautifulSoup doesn't understand the HTML5 on svt.se the <article> tags aren't nested inside the containing <div> since it isn't included in the NESTABLE_BLOCK_TAGS touple. ('blockquote', 'div', 'fieldset', 'ins', 'del')
Perhaps pyquery in combination with html5lib is a nice solution, will check it out tonight...
Posts: 1,265
Joined: Oct 2009
Reputation:
29
takoi
Team-Kodi Member
Posts: 1,265
2012-10-02, 11:27
(This post was last modified: 2012-10-02, 11:28 by takoi.)
But seriously, use parseDOM if it can handle your site. It's faster, requires far less boilerplate and usually gives you a lot more beautiful code than "beautiful"soup. You and your maintainers will thank me later..
Posts: 62
Joined: Dec 2009
Reputation:
2
Tried that fix too but it throws an error, at least for the version included in the addon repo: 'tuple' object has no attribute 'append'
Posts: 793
Joined: Oct 2010
Reputation:
17
What's wrong with just using regex on the page source?
Posts: 62
Joined: Dec 2009
Reputation:
2
Maintainability I would say
Posts: 1,299
Joined: Jul 2009
Reputation:
59
sphere
Retired Team-Kodi Member
Posts: 1,299
Maybe this is offtopic, but...
Before you consider scraping HTML double check that the website/webservice really has no JSON or XML API. Most bigger services have one. For a quick check just have a look in Apple App Store or Google Play. If the website has an Smartphone App, it definitely has an (maybe hidden) API.