thr3ads.net - CentOS - [CentOS] perl code to remove newlines [Dec 2010]

If this information is useful, please help other people find it:
Share via:

ken

2010-Dec-30 13:19 UTC

[CentOS] perl code to remove newlines

Given an HTML file which looks like this:

--------- begin snippet ---------
<HTML><HEAD
><TITLE
>We've Lied to You&#8230;</TITLE
><METANAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
REL="HOME"
TITLE="Maximum RPM"
HREF="index.html"><LINK
REL="UP"
TITLE="Using RPM to Verify Installed Packages"
HREF="ch-rpm-verify.html"><LINK
...
--------- end snippet ---------

I'm coding some perl to make it look something like this:

--------- begin snippet ---------
<html>
<head>
<title>We've Lied to You&#8230;</title>

<meta name="generator" content="Modular DocBook HTML
Stylesheet Version
1.79">

<link rel="HOME" title="Maximum RPM"
href="index.html">

<line rel="UP" title="Using RPM to Verify Installed
Packages"
href="ch-rpm-verify.html">

<link ....
--------- end snippet ---------

I've hit a wall trying to remove all the newlines.  I've tried it
several ways... here's just one:

--------- begin snippet ---------
while (<$in>)
{
    s/<(\w*\W)/<\L$1/g;		# Downcase XXX in "<XXX".
    s/<\/(\w*\W)/<\/\L$1/g;	# Downcase XXX in "</XXX".
    if(/^>/)			# if this line starts with '>'
    {				# then
	$curr = tell $in;	# Note current file position,
	seek $in, $prev, 0;	# go back to previous line,
	chomp;			# remove its trailing newline char,
	seek $in, $curr, 0;	# and reset position to current line.
    }
    else
    {
	$curr = tell $in;	# Note current file position,
	seek $in, $prev, 0;	# go back to previous line
	s/\n/ /; 		# Append a space,
	chop;			# and then chomp.
	seek $in, $curr, 0;	# and reset position to current line.
    }
    print;
    print $out;
    $prev = tell $in;		# Location of previous line.
}
--------- end snippet ---------

When I cat the output file, it looks like this:

--------- begin snippet ---------
GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've
Lied to
You&#8230;</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular
DocBook HTML Stylesheet Version
1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum
RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using
RPM to Verify Installed
Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c)....
--------- end snippet ---------

The output I should say *is* all on one line, not line-wrapped the way
you see it above.  I have a hunch as to why there are the
"GLOB(0x9fd587c)" thingies everywhere the newlines or spaces ('
')
should be.  If some expert here could explain them, that would be really
good.  More importantly though would be some instruction as to how to
remove the newlines without creating all the GLOB(...) garbage.  Might I
have to rewrite the script so to open the file in binary mode... or what?


Maximum thanks for your assistance.

Bowie Bailey

2010-Dec-30 14:18 UTC

head link

[CentOS] perl code to remove newlines

On 12/30/2010 8:19 AM, ken wrote:> Given an HTML file which looks like this:
>
> --------- begin snippet ---------
> <HTML
>> <HEAD
>> <TITLE
>> We've Lied to You&#8230;</TITLE
>> <META
> NAME="GENERATOR"
> CONTENT="Modular DocBook HTML Stylesheet Version
1.79"><LINK
> REL="HOME"
> TITLE="Maximum RPM"
> HREF="index.html"><LINK
> REL="UP"
> TITLE="Using RPM to Verify Installed Packages"
> HREF="ch-rpm-verify.html"><LINK
> ...
> --------- end snippet ---------
>
> I'm coding some perl to make it look something like this:
>
> --------- begin snippet ---------
> <html>
> <head>
> <title>We've Lied to You&#8230;</title>
>
> <meta name="generator" content="Modular DocBook HTML
Stylesheet Version
> 1.79">
>
> <link rel="HOME" title="Maximum RPM"
href="index.html">
>
> <line rel="UP" title="Using RPM to Verify Installed
Packages"
> href="ch-rpm-verify.html">
>
> <link ....
> --------- end snippet ---------
>
> I've hit a wall trying to remove all the newlines.  I've tried it
> several ways... here's just one:
>
> --------- begin snippet ---------
> while (<$in>)
> {
>     s/<(\w*\W)/<\L$1/g;		# Downcase XXX in "<XXX".
>     s/<\/(\w*\W)/<\/\L$1/g;	# Downcase XXX in "</XXX".
>     if(/^>/)			# if this line starts with '>'
>     {				# then
> 	$curr = tell $in;	# Note current file position,
> 	seek $in, $prev, 0;	# go back to previous line,
> 	chomp;			# remove its trailing newline char,
> 	seek $in, $curr, 0;	# and reset position to current line.
>     }
>     else
>     {
> 	$curr = tell $in;	# Note current file position,
> 	seek $in, $prev, 0;	# go back to previous line
> 	s/\n/ /; 		# Append a space,
> 	chop;			# and then chomp.
> 	seek $in, $curr, 0;	# and reset position to current line.
>     }
>     print;
>     print $out;
>     $prev = tell $in;		# Location of previous line.
> }
> --------- end snippet ---------
>
> When I cat the output file, it looks like this:
>
> --------- begin snippet ---------
>
GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've
> Lied to
>
You&#8230;</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular
> DocBook HTML Stylesheet Version
>
1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum
>
RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using
> RPM to Verify Installed
>
Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c)....
> --------- end snippet ---------
>
> The output I should say *is* all on one line, not line-wrapped the way
> you see it above.  I have a hunch as to why there are the
> "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces
(' ')
> should be.  If some expert here could explain them, that would be really
> good.  More importantly though would be some instruction as to how to
> remove the newlines without creating all the GLOB(...) garbage.  Might I
> have to rewrite the script so to open the file in binary mode... or what?
So you are trying to remove all of the newlines inside the tags?

I would approach it from the other direction.  Remove ALL of the
newlines and then add back the ones you want.

Something like this (untested):

$irs = $/;
$/ = undef;
$html = <$in>;
$/ = $irs;

$html =~ s/\n/ /g;                 # Replace all newlines with spaces
$html =~ s/(<\w+)/\n$1/g;  # Add a newline before all begin tags
print $html . "\n";

This pulls in the whole file before it starts processing, but as long as
it is not ridiculously huge, this should not be a problem.

-- 
Bowie

Sean

2010-Dec-30 19:20 UTC

head link

[CentOS] perl code to remove newlines

Not sure exactly what you are trying to do, but Tie::File might be worth 
a look at if you haven't done so already?
Sean

ken wrote:> Given an HTML file which looks like this:
>
> --------- begin snippet ---------
> <HTML
>   
>> <HEAD
>> <TITLE
>> We've Lied to You&#8230;</TITLE
>> <META
>>     
> NAME="GENERATOR"
> CONTENT="Modular DocBook HTML Stylesheet Version
1.79"><LINK
> REL="HOME"
> TITLE="Maximum RPM"
> HREF="index.html"><LINK
> REL="UP"
> TITLE="Using RPM to Verify Installed Packages"
> HREF="ch-rpm-verify.html"><LINK
> ...
> --------- end snippet ---------
>
> I'm coding some perl to make it look something like this:
>
> --------- begin snippet ---------
> <html>
> <head>
> <title>We've Lied to You&#8230;</title>
>
> <meta name="generator" content="Modular DocBook HTML
Stylesheet Version
> 1.79">
>
> <link rel="HOME" title="Maximum RPM"
href="index.html">
>
> <line rel="UP" title="Using RPM to Verify Installed
Packages"
> href="ch-rpm-verify.html">
>
> <link ....
> --------- end snippet ---------
>
> I've hit a wall trying to remove all the newlines.  I've tried it
> several ways... here's just one:
>
> --------- begin snippet ---------
> while (<$in>)
> {
>     s/<(\w*\W)/<\L$1/g;		# Downcase XXX in "<XXX".
>     s/<\/(\w*\W)/<\/\L$1/g;	# Downcase XXX in "</XXX".
>     if(/^>/)			# if this line starts with '>'
>     {				# then
> 	$curr = tell $in;	# Note current file position,
> 	seek $in, $prev, 0;	# go back to previous line,
> 	chomp;			# remove its trailing newline char,
> 	seek $in, $curr, 0;	# and reset position to current line.
>     }
>     else
>     {
> 	$curr = tell $in;	# Note current file position,
> 	seek $in, $prev, 0;	# go back to previous line
> 	s/\n/ /; 		# Append a space,
> 	chop;			# and then chomp.
> 	seek $in, $curr, 0;	# and reset position to current line.
>     }
>     print;
>     print $out;
>     $prev = tell $in;		# Location of previous line.
> }
> --------- end snippet ---------
>
> When I cat the output file, it looks like this:
>
> --------- begin snippet ---------
>
GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've
> Lied to
>
You&#8230;</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular
> DocBook HTML Stylesheet Version
>
1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum
>
RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using
> RPM to Verify Installed
>
Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c)....
> --------- end snippet ---------
>
> The output I should say *is* all on one line, not line-wrapped the way
> you see it above.  I have a hunch as to why there are the
> "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces
(' ')
> should be.  If some expert here could explain them, that would be really
> good.  More importantly though would be some instruction as to how to
> remove the newlines without creating all the GLOB(...) garbage.  Might I
> have to rewrite the script so to open the file in binary mode... or what?
>
>
> Maximum thanks for your assistance.
>
>
>
>
>

Jerry McAllister

2010-Dec-30 22:41 UTC

head link

[CentOS] perl code to remove newlines

On Thu, Dec 30, 2010 at 08:19:00AM -0500, ken wrote:

It isn't perl, but does 'tr' exist in CentOS (it does in FreeBSD)?
It would do it.

////jerry

> 
> Given an HTML file which looks like this:
> 
> --------- begin snippet ---------
> <HTML
> ><HEAD
> ><TITLE
> >We've Lied to You&#8230;</TITLE
> ><META
> NAME="GENERATOR"
> CONTENT="Modular DocBook HTML Stylesheet Version
1.79"><LINK
> REL="HOME"
> TITLE="Maximum RPM"
> HREF="index.html"><LINK
> REL="UP"
> TITLE="Using RPM to Verify Installed Packages"
> HREF="ch-rpm-verify.html"><LINK
> ...
> --------- end snippet ---------
> 
> I'm coding some perl to make it look something like this:
> 
> --------- begin snippet ---------
> <html>
> <head>
> <title>We've Lied to You&#8230;</title>
> 
> <meta name="generator" content="Modular DocBook HTML
Stylesheet Version
> 1.79">
> 
> <link rel="HOME" title="Maximum RPM"
href="index.html">
> 
> <line rel="UP" title="Using RPM to Verify Installed
Packages"
> href="ch-rpm-verify.html">
> 
> <link ....
> --------- end snippet ---------
> 
> I've hit a wall trying to remove all the newlines.  I've tried it
> several ways... here's just one:
> 
> --------- begin snippet ---------
> while (<$in>)
> {
>     s/<(\w*\W)/<\L$1/g;		# Downcase XXX in "<XXX".
>     s/<\/(\w*\W)/<\/\L$1/g;	# Downcase XXX in "</XXX".
>     if(/^>/)			# if this line starts with '>'
>     {				# then
> 	$curr = tell $in;	# Note current file position,
> 	seek $in, $prev, 0;	# go back to previous line,
> 	chomp;			# remove its trailing newline char,
> 	seek $in, $curr, 0;	# and reset position to current line.
>     }
>     else
>     {
> 	$curr = tell $in;	# Note current file position,
> 	seek $in, $prev, 0;	# go back to previous line
> 	s/\n/ /; 		# Append a space,
> 	chop;			# and then chomp.
> 	seek $in, $curr, 0;	# and reset position to current line.
>     }
>     print;
>     print $out;
>     $prev = tell $in;		# Location of previous line.
> }
> --------- end snippet ---------
> 
> When I cat the output file, it looks like this:
> 
> --------- begin snippet ---------
>
GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've
> Lied to
>
You&#8230;</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular
> DocBook HTML Stylesheet Version
>
1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum
>
RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using
> RPM to Verify Installed
>
Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c)....
> --------- end snippet ---------
> 
> The output I should say *is* all on one line, not line-wrapped the way
> you see it above.  I have a hunch as to why there are the
> "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces
(' ')
> should be.  If some expert here could explain them, that would be really
> good.  More importantly though would be some instruction as to how to
> remove the newlines without creating all the GLOB(...) garbage.  Might I
> have to rewrite the script so to open the file in binary mode... or what?
> 
> 
> Maximum thanks for your assistance.
> 
> 
> 
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos

Bart Schaefer

2010-Dec-31 03:52 UTC

head link

[CentOS] perl code to remove newlines

On Thu, Dec 30, 2010 at 5:19 AM, ken <gebser at mousecar.com>
wrote:>
>
> --------- begin snippet ---------
> while (<$in>)
> {
> ? ?s/<(\w*\W)/<\L$1/g; ? ? ? ? # Downcase XXX in "<XXX".
> ? ?s/<\/(\w*\W)/<\/\L$1/g; ? ? # Downcase XXX in
"</XXX".
chomp;  # Always remove the newline
unless (/<html/) {
   # Not on first line, so

> ? ?if(/^>/) ? ? ? ? ? ? ? ? ? ?# if this line starts with '>'
> ? ?{ ? ? ? ? ? ? ? ? ? ? ? ? ? # then
> ? ? ? ?$curr = tell $in; ? ? ? # Note current file position,
> ? ? ? ?seek $in, $prev, 0; ? ? # go back to previous line,
> ? ? ? ?chomp; ? ? ? ? ? ? ? ? ?# remove its trailing newline char,
> ? ? ? ?seek $in, $curr, 0; ? ? # and reset position to current line.
> ? ?}
> ? ?else
> ? ?{
> ? ? ? ?$curr = tell $in; ? ? ? # Note current file position,
> ? ? ? ?seek $in, $prev, 0; ? ? # go back to previous line
> ? ? ? ?s/\n/ /; ? ? ? ? ? ? ? ?# Append a space,
> ? ? ? ?chop; ? ? ? ? ? ? ? ? ? # and then chomp.
> ? ? ? ?seek $in, $curr, 0; ? ? # and reset position to current line.
> ? ?}
> ? ?print;
> ? ?print $out;
> ? ?$prev = tell $in; ? ? ? ? ? # Location of previous line.
> }
> --------- end snippet ---------
>
> When I cat the output file, it looks like this:
>
> --------- begin snippet ---------
>
GLOB(0x9fd587c)<htmlGLOB(0x9fd587c)><headGLOB(0x9fd587c)><titleGLOB(0x9fd587c)>We've
> Lied to
>
You&#8230;</titleGLOB(0x9fd587c)><metaGLOB(0x9fd587c)NAME="GENERATOR"GLOB(0x9fd587c)CONTENT="Modular
> DocBook HTML Stylesheet Version
>
1.79"><linkGLOB(0x9fd587c)REL="HOME"GLOB(0x9fd587c)TITLE="Maximum
>
RPM"GLOB(0x9fd587c)HREF="index.html"><linkGLOB(0x9fd587c)REL="UP"GLOB(0x9fd587c)TITLE="Using
> RPM to Verify Installed
>
Packages"GLOB(0x9fd587c)HREF="ch-rpm-verify.html"><linkGLOB(0x9fd587c)....
> --------- end snippet ---------
>
> The output I should say *is* all on one line, not line-wrapped the way
> you see it above. ?I have a hunch as to why there are the
> "GLOB(0x9fd587c)" thingies everywhere the newlines or spaces
(' ')
> should be. ?If some expert here could explain them, that would be really
> good. ?More importantly though would be some instruction as to how to
> remove the newlines without creating all the GLOB(...) garbage. ?Might I
> have to rewrite the script so to open the file in binary mode... or what?
>
>
> Maximum thanks for your assistance.
>
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos

Seemingly Similar Threads

Search for more reasonably related threads

CentOS - Dec 2010 - perl code to remove newlines

[CentOS] perl code to remove newlines

[CentOS] perl code to remove newlines

[CentOS] perl code to remove newlines

[CentOS] perl code to remove newlines

[CentOS] perl code to remove newlines

Seemingly Similar Threads