[Release] Parsedom and other functions

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
newatv2user Offline
Fan
Posts: 301
Joined: May 2011
Reputation: 27
Post: #71
The URL I'm trying to parse is:
http://topdocumentaryfilms.com/all/

My code is this:
Quote:itemsDOM = common.parseDOM(contents, "div", attrs = { "class": "wrapexcerpt"}, ret=False)

I swear it was working couple of days back. It's not working anymore. I tried your suggestion with replace, but still no go. Any hint on how I could fix this would be great. Thanks.
find quote
TobiasTheCommie Offline
Skilled Python Coder
Posts: 621
Joined: Apr 2008
Reputation: 3
Post: #72
newatv2user Wrote:The URL I'm trying to parse is:
http://topdocumentaryfilms.com/all/

My code is this:


I swear it was working couple of days back. It's not working anymore. I tried your suggestion with replace, but still no go. Any hint on how I could fix this would be great. Thanks.

Ok, i've downloaded the page and added two(so far) integration tests on it, that fail.

I'll see what i figure out(On fix and workaround).

ETA: .replace("\n", " ") should do the trick. I'm doing that in the parseDOM for the next version. You can do it beforehand so you don't have to wait.
(This post was last modified: 2012-01-31 02:31 by TobiasTheCommie.)
find quote
newatv2user Offline
Fan
Posts: 301
Joined: May 2011
Reputation: 27
Post: #73
Awesome. That worked. I did replace("\n", "") before which didn't work. The space did the trick.

Thanks a lot.
find quote
newatv2user Offline
Fan
Posts: 301
Joined: May 2011
Reputation: 27
Post: #74
More problem on the same page.

Part of debug log:
Quote:21:05:21 T:1036 NOTICE: <ul style="background:#efefef;"><li style="padding:5px;font-size:13px;"><strong>Recommended Documentaries</strong></li></ul><ul><li><a href="http://topdocumentaryfilms.com/planet-earth-the-complete-bbc-series/">Planet Earth: The Complete BBC Series</a></li><li><a href="http://topdocumentaryfilms.com/cosmos/">Cosmos: A Personal Voyage (Carl Sagan)</a></li><li><a href="http://topdocumentaryfilms.com/philosophy-guide-to-happiness/">Philosophy – Guide to Happiness</a></li><li><a href="http://topdocumentaryfilms.com/through-the-wormhole/">Through the Wormhole</a></li><li><a href="http://topdocumentaryfilms.com/the-lost-world-of-lake-vostok/">The Lost World of Lake Vostok</a></li><li><a href="http://topdocumentaryfilms.com/story-of-science/">The Story of Science: Power, Proof and Passion</a></li><li><a href="http://topdocumentaryfilms.com/james-burke-connections/">James Burke: Connections</a></li><li><a href="http://topdocumentaryfilms.com/genius-charles-darwin/">The Genius of Charles Darwin</a></li><li>Universe: <a href="http://topdocumentaryfilms.com/universe-season-1/">Season 1</a>, <a href="http://topdocumentaryfilms.com/universe-season-2/">Season 2</a>, <a href="http://topdocumentaryfilms.com/universe-season-3/">Season 3</a>, <a href="http://topdocumentaryfilms.com/universe-season-4/">Season 4</a>, <a href="http://topdocumentaryfilms.com/universe-season-5/">Season 5</a></li><li><a href="http://topdocumentaryfilms.com/why-i-am-no-longer-a-christian/">Why I Am No Longer a Christian</a></li></ul>
21:05:21 T:1036 NOTICE: [TopDoc - 0.0.1] parseDOM : 'start: 'li' - {} - False - <type 'str'>'
21:05:21 T:1036 NOTICE: [TopDoc - 0.0.1] parseDOM : 'no list found, making one on just the element name'
21:05:21 T:1036 NOTICE: [TopDoc - 0.0.1] parseDOM : 'Getting element content for 1 matches '

This is the code i'm using:
Quote:print item
recDOM2 = common.parseDOM(item, "li")

It used to find all "li", now it won't. Am I doing something wrong here?
find quote
newatv2user Offline
Fan
Posts: 301
Joined: May 2011
Reputation: 27
Post: #75
Hi Tobi

I am still having some problems with parsedom.

Quote:<div class="post-right"><h3><a href="http://documentarystorm.com/last-chance-to-see/" rel="bookmark" title="Stream this documentary: Last Chance to See">Last Chance to See</a></h3><p class="post-meta">Jan 29th, 2012 // <a href="http://documentarystorm.com/category/nature-biology/animals-nature-biology/" title="View all posts in Animals" rel="category tag">Animals</a>, <a href="http://documentarystorm.com/category/nature-biology/" title="View all posts in Nature" rel="category tag">Nature</a> // <a href="http://documentarystorm.com/last-chance-to-see/#comments" title="Comment on Last Chance to See">2 Comments ยป</a></p><p>Stephen Fry and zoologist Mark Carwardine head to the ends of the earth in search of animals on the edge of extinction.</p><div style="display: none">VN:RO [1.9.13_1145]</div><div class="ratingblock "><div class="ratingheader "></div><div class="ratingstarsinline "><div id="article_rater_7827" class="ratepost gdsr-pumpkin gdsr-size-20"><div class="starsbar gdsr-size-20"><div class="gdouter gdheight"><div id="gdr_vote_a7827" style="width: 118.181818182px;" class="gdinner gdheight"></div></div></div></div></div></div></div>

On the above HTML, how do I get the second <p> that contains the description. It is only returning the first <p> with class. How do I get the second one?

Thanks.
find quote
TobiasTheCommie Offline
Skilled Python Coder
Posts: 621
Joined: Apr 2008
Reputation: 3
Post: #76
if you do
result = parseDOM(data, "p")

Then the second p should be in result[1].
find quote
newatv2user Offline
Fan
Posts: 301
Joined: May 2011
Reputation: 27
Post: #77
That is exactly what I think I am doing.

Quote:print item
Plot = common.parseDOM(item, "p")
print 'ParseDOM returned: ' + str(len(Plot))

But I am not getting the desired result:
Quote:08:36:34 T:828 NOTICE: <div class="post-left"><a href="http://documentarystorm.com/last-chance-to-see/" title="Last Chance to See"><img src="http://documentarystorm.com/files/2012/01/last-chance-to-see1.jpg" alt="Last Chance to See (documentary)" height="150" width="150" /></a></div><div class="post-right"><h3><a href="http://documentarystorm.com/last-chance-to-see/" rel="bookmark" title="Stream this documentary: Last Chance to See">Last Chance to See</a></h3><p class="post-meta">Jan 29th, 2012 // <a href="http://documentarystorm.com/category/nature-biology/animals-nature-biology/" title="View all posts in Animals" rel="category tag">Animals</a>, <a href="http://documentarystorm.com/category/nature-biology/" title="View all posts in Nature" rel="category tag">Nature</a> // <a href="http://documentarystorm.com/last-chance-to-see/#comments" title="Comment on Last Chance to See">2 Comments »</a></p><p>Stephen Fry and zoologist Mark Carwardine head to the ends of the earth in search of animals on the edge of extinction.</p><div class="gdsrcacheloader gdsrclsmall" id="gdsrc_asr.7827.0.1.1327816953.48.1.20.6.4.0"><strong>GD Star Rating</strong><br /><em>a WordPress rating system</em></div></div><div class="clearfix"></div>
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] parseDOM : 'start: 'p' - {} - False - <type 'str'>'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] parseDOM : 'no list found, making one on just the element name'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] parseDOM : 'Getting element content for 1 matches '
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] _getDOMContent : 'match: <p class="post-meta">'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] _getDOMContent : 'start: 441, len: 21, end: 887'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] _getDOMContent : 'done html length: 425'
08:36:34 T:828 NOTICE: [DocumentaryStorm - 0.0.1] parseDOM : 'Done'
08:36:34 T:828 NOTICE: ParseDOM returned: 1

I think my problem on post #74 is also similar. If there are mixed <li> with and without attributes, it is causing problem.

Or maybe I have a corrupted copy of parsedom. How do I check or reinstall?

Thanks.
find quote
TobiasTheCommie Offline
Skilled Python Coder
Posts: 621
Joined: Apr 2008
Reputation: 3
Post: #78
newatv2user Wrote:That is exactly what I think I am doing.



But I am not getting the desired result:


I think my problem on post #74 is also similar. If there are mixed <li> with and without attributes, it is causing problem.

Or maybe I have a corrupted copy of parsedom. How do I check or reinstall?

Thanks.

Confirmed and fixed in trunk.

Workaround is to do .replace("<p>", "<p />") on the input before passing it to parseDOM.
find quote
newatv2user Offline
Fan
Posts: 301
Joined: May 2011
Reputation: 27
Post: #79
More problems.

Portion of HTML I'm using: http://pastebin.com/C1imeTMG

My code:
Quote:suckerfishDOM = common.parseDOM(contents, "ul", attrs = { "id": "suckerfishnav"})[0]
catDOM = common.parseDOM(suckerfishDOM, "li", attrs = { "class": "cat-item cat-item-[0-9]{1,}"})
print 'Debug Info - catDOM length: ' + str(len(catDOM))
for dCat in catDOM:
print 'looping through catDOM'
print 'Debug Info: ' + dCat
if dCat is None or dCat == '':
continue

Resulting portion of log: http://pastebin.com/zAbk89n9

In summary, only the first match in catDOM is non empty. All the rest are empty. Am I doing it correctly?

Thanks.
find quote
TobiasTheCommie Offline
Skilled Python Coder
Posts: 621
Joined: Apr 2008
Reputation: 3
Post: #80
newatv2user Wrote:More problems.

Portion of HTML I'm using: http://pastebin.com/C1imeTMG

My code:


Resulting portion of log: http://pastebin.com/zAbk89n9

In summary, only the first match in catDOM is non empty. All the rest are empty. Am I doing it correctly?

Thanks.

Hm, annoying,

I will integration test this and get back to you.

ETA:
This works for me (albeit in trunk code, but should work for you as well i hope).
Code:
ret = common.parseDOM(self.readTestInput("documentarystorm2.html", False), "ul", attrs = { "id": "suckerfishnav"})
        print repr(ret)

        ret2 = common.parseDOM(ret, "ul", attrs = { "class": "children"})
        print "2: " + repr(ret2[0])

    for ret    in ret2:
            ret3 = common.parseDOM(ret, "li", attrs = { "class": "cat-item cat-item-[0-9]{1,}"})
            print "3: " + repr(ret3)
(This post was last modified: 2012-02-05 13:07 by TobiasTheCommie.)
find quote
Post Reply