thr3ads.net - Mechanize users - [Mechanize-users] Malformed HTML [May 2009]

If this information is useful, please help other people find it:
Share via:

Chris Miller

2009-May-08 01:01 UTC

[Mechanize-users] Malformed HTML

I''m using Mechanize to parse an extraordinarily malformed html page.

After submitting a form like so:
   page = mech.submit(dform)

The result I get back is truncated.  I suspect that it''s because the
source HTML looks like this:

<html>
<head> yadda yadda</head>
     <p>some text</p>

    <html>
    <table yadda yadda>


My ''page'' variable contains only the data that occurs before
the
second <html> tag.

Am I right in suspecting that this is the cause of my problems?  Are
there any work-arounds that will enable me to grab all of the text,
even if it can''t be parsed sanely?

Thanks,

Chris Miller
chrisamiller at gmail.com

Aaron Starr

2009-May-08 01:31 UTC

head link

[Mechanize-users] Malformed HTML

page.body should have the raw, original
text-that-kind-of-reminds-us-of-html, does it not?


On Thu, May 7, 2009 at 6:01 PM, Chris Miller <chrisamiller at gmail.com>
wrote:
> I''m using Mechanize to parse an extraordinarily malformed html
page.
>
> After submitting a form like so:
>   page = mech.submit(dform)
>
> The result I get back is truncated.  I suspect that it''s because
the
> source HTML looks like this:
>
> <html>
> <head> yadda yadda</head>
>     <p>some text</p>
>
>    <html>
>    <table yadda yadda>
>
>
> My ''page'' variable contains only the data that occurs
before the
> second <html> tag.
>
> Am I right in suspecting that this is the cause of my problems?  Are
> there any work-arounds that will enable me to grab all of the text,
> even if it can''t be parsed sanely?
>
> Thanks,
>
> Chris Miller
> chrisamiller at gmail.com
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20090507/f9cedb37/attachment.html>

Chris Miller

2009-May-08 16:03 UTC

head link

[Mechanize-users] Malformed HTML

That solves my problem - I somehow missed that in the docs.  Thanks
for your help.

-Chris


That helps somewhat. Now, when I parse a local copy of the html page,
I get everything by using the page.body command.

The problem is, that when I try to retrieve the data from the server,
the page.body command cuts off at the original point - right before
the second <html> tag.

The html file in question can be found here:
http://chrisamiller.com/temp.html

Hrmm... maybe some of the page is being written out by javascript.  If
that''s the case, mechanize won''t be able to deal with it,
right?


-Chris




On Thu, May 7, 2009 at 8:31 PM, Aaron Starr <astarr at wiredquote.com>
wrote:> page.body should have the raw, original
> text-that-kind-of-reminds-us-of-html, does it not?
>
>
> On Thu, May 7, 2009 at 6:01 PM, Chris Miller <chrisamiller at
gmail.com> wrote:
>>
>> I''m using Mechanize to parse an extraordinarily malformed html
page.
>>
>> After submitting a form like so:
>> ? page = mech.submit(dform)
>>
>> The result I get back is truncated. ?I suspect that it''s
because the
>> source HTML looks like this:
>>
>> <html>
>> <head> yadda yadda</head>
>> ? ? <p>some text</p>
>>
>> ? ?<html>
>> ? ?<table yadda yadda>
>>
>>
>> My ''page'' variable contains only the data that occurs
before the
>> second <html> tag.
>>
>> Am I right in suspecting that this is the cause of my problems? ?Are
>> there any work-arounds that will enable me to grab all of the text,
>> even if it can''t be parsed sanely?
>>
>> Thanks,
>>
>> Chris Miller
>> chrisamiller at gmail.com
>> _______________________________________________
>> Mechanize-users mailing list
>> Mechanize-users at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/mechanize-users
>
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>

Seemingly Similar Threads

Search for more possibly parallel threads

Mechanize users - May 2009 - Malformed HTML

[Mechanize-users] Malformed HTML

[Mechanize-users] Malformed HTML

[Mechanize-users] Malformed HTML

Seemingly Similar Threads