thr3ads.net - R help - [R] using a regular expression [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Glenn Schultz

2016-Sep-10 19:23 UTC

[R] using a regular expression

I have a file that for basically carries three datasets of differing lengths.
?To make this a single downloadable file the creator of the file as used both
NUL hex00 and space hex20 to normalize the lengths.

Below is the function that I am writing. ?I am using sed to replace the hex
characters. ?First, to get past NUL I use sed to replace hex 00 with hex 20.
?This has worked. ?Once the Nul is removed and can successfully parse the file
with ReadLine sub_str. ?This final step before delimiting the file and making it
nice and tidy is to remove the hex 20 characters. ? I am using the same strategy
to eliminate the spaces and sed command works in a shell but does not work in
the R function. ?What am I doing wrong? ?I have dput - some of the nastier lines
with hex 20 characters below my code.

Any advice is appreciated.

Glenn

arm <- function(filepath){
callpath <- paste(filepath, "arm.txt", sep ="")
ARMReturn <- paste(filepath, "arm.csv", sep = "")
ARMPoolReturnPath <- paste(filepath,"armatpool.csv", sep =
"")
ARMNextChgReturnPath <- paste(filepath,"nexratechangedate.csv", sep
= "")
ARMFirstPmtReturnPath <- paste(filepath,"firstpaymentdate.csv", sep
= "")

# This file contains NUL hex characters before parsing the file replace
# the hex NUL x00 with space x20 and save as a csv file. Use system command
sedcommand <- paste("sed -e 's/\\x00/\\x20/g' <", 
filepath, "arm.txt", 
">", "arm.csv", sep = " ")
system(sedcommand)

# read the arm quartile data to a file once skipNuls then length of each
# record set changes and the data map provided by FNMA is no longer valid
# with respect to the length of each embedded data set
data <- readLines(ARMReturn, encoding = "ascii")

quartile <- NULL
numchar <- nchar(x = data, type = "chars")
start <- c(seq(1, numchar, 399))
end <- c(seq(399, numchar, 399))
quartile <- str_sub(data, start[1:length(start)], end[1:length(end)])
write(quartile, ARMReturn)

# The file has been parsed accroding to length 400 for each data element.
# The next step is to remove all the trailing white space hex character
# x20

sedcommand2 <- paste("sed -e '/\\x20/d' <", 
filepath, "arm.csv", 
">", "arm2.csv", sep = "")
system(sedcommand2)
} # end of function


c("                                                 555556
WS320021201006125{000378{000348{                                                
",
"                                                  555556
WS320021201006250{000954{000880{                                                
",
"                                                   555556
WS320021201005625{001062{000983{                                                
",
"                                                    555556
WS320030101005250{000027{000025{                                                
",
"                                                     555556
WS320030101006500{000033{000030{                                                
",
"                                                      555556
WS320030101005125{000061{000056{                                                
",
"                                                       555556
WS320030101005375{000095{000088{                                                
",
"                                                        555556
WS320030101005350{000217{000200{                                                
",
"                                                         555556
WS320030101006125{000400{000369{                                                
",
"                                                          555556
WS320030101005310{000439{000406{                                                
",
"                                                           555556
WS320030101006000{000573{000529{                                                
"

David Wolfskill

2016-Sep-11 13:50 UTC

head link

[R] using a regular expression

On Sat, Sep 10, 2016 at 07:23:37PM +0000, Glenn Schultz
wrote:> ...
> Below is the function that I am writing. ?I am using sed to replace the hex
characters. ?First, to get past NUL I use sed to replace hex 00 with hex 20.
?This has worked. ?Once the Nul is removed and can successfully parse the file
with ReadLine sub_str. ?This final step before delimiting the file and making it
nice and tidy is to remove the hex 20 characters. ? I am using the same strategy
to eliminate the spaces and sed command works in a shell but does not work in
the R function. ?What am I doing wrong? ?I have dput - some of the nastier lines
with hex 20 characters below my code.
I believe that you will find that the sed "d" command deletes the
"pattern space" (in a simple text file, it would delete the line) in
which the specified regular expression is found.

I suspect that you actually want to eliminate the "space" characters
themselves, so rather than:
> ...
> # The file has been parsed accroding to length 400 for each data element.
> # The next step is to remove all the trailing white space hex character
> # x20
> 
> sedcommand2 <- paste("sed -e '/\\x20/d' <", 
what is wanted is:

sedcommand2 <- paste("sed -e 's/\\x20//g' <", 
> ... 
Note that you might consider using R's gsub() function to perform that
"space elimination"both natively and a bit earlier.

Peace,
david
-- 
David H. Wolfskill				r at catwhisker.org
Those who would murder in the name of God or prophet are blasphemous cowards.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 603 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20160911/9bf52a5e/attachment.bin>

Marco Silva

2016-Sep-11 15:54 UTC

head link

[R] using a regular expression

Excerpts from Glenn Schultz's message of 2016-09-10 19:23:37
+0000:> I have a file that for basically carries three datasets of differing
> lengths. ?To make this a single downloadable file the creator of the
> file as used both NUL hex00 and space hex20 to normalize the lengths.
> 
> Below is the function that I am writing. ?I am using sed to replace
> the hex characters. ?First, to get past NUL I use sed to replace hex
> 00 with hex 20. ?This has worked. ?Once the Nul is removed and can
> successfully parse the file with ReadLine sub_str. ?This final step
> before delimiting the file and making it nice and tidy is to remove
> the hex 20 characters. ? I am using the same strategy to eliminate the
> spaces and sed command works in a shell but does not work in the R
> function. ?What am I doing wrong? ?I have dput - some of the nastier
> lines with hex 20 characters below my code.
> 
> Any advice is appreciated.
You can use readLines(pipe(sedcommand)) to get the filtered dataset.

I didn't understand what kind of filtering you are doing, it seems
confused to me. But, someone pointed out that use of command 'd' is for
deletion of the role pattern space, so if you are trying to substitute
use:

s/pattern//g # effectively removing pattern from the text


Best Luck,

Marco

-- 
Marco Arthur @ (M)arco Creatives

Jeff Newmiller

2016-Sep-12 16:32 UTC

head link

[R] using a regular expression

If you think you might want to put this function into a package, it would be
much better to use gsub instead of passing the job off to an external program,
because non-POSIX operating systems (Windows) will be a headache to support.
-- 
Sent from my phone. Please excuse my brevity.

On September 10, 2016 12:23:37 PM PDT, Glenn Schultz <glennmschultz at
me.com> wrote:>I have a file that for basically carries three datasets of differing
>lengths. ?To make this a single downloadable file the creator of the
>file as used both NUL hex00 and space hex20 to normalize the lengths.
>
>Below is the function that I am writing. ?I am using sed to replace the
>hex characters. ?First, to get past NUL I use sed to replace hex 00
>with hex 20. ?This has worked. ?Once the Nul is removed and can
>successfully parse the file with ReadLine sub_str. ?This final step
>before delimiting the file and making it nice and tidy is to remove the
>hex 20 characters. ? I am using the same strategy to eliminate the
>spaces and sed command works in a shell but does not work in the R
>function. ?What am I doing wrong? ?I have dput - some of the nastier
>lines with hex 20 characters below my code.
>
>Any advice is appreciated.
>
>Glenn
>
>arm <- function(filepath){
>callpath <- paste(filepath, "arm.txt", sep ="")
>ARMReturn <- paste(filepath, "arm.csv", sep = "")
>ARMPoolReturnPath <- paste(filepath,"armatpool.csv", sep =
"")
>ARMNextChgReturnPath <- paste(filepath,"nexratechangedate.csv",
sep >"")
>ARMFirstPmtReturnPath <- paste(filepath,"firstpaymentdate.csv",
sep >"")
>
># This file contains NUL hex characters before parsing the file replace
># the hex NUL x00 with space x20 and save as a csv file. Use system
>command
>sedcommand <- paste("sed -e 's/\\x00/\\x20/g' <", 
>filepath, "arm.txt", 
>">", "arm.csv", sep = " ")
>system(sedcommand)
>
># read the arm quartile data to a file once skipNuls then length of
>each
># record set changes and the data map provided by FNMA is no longer
>valid
># with respect to the length of each embedded data set
>data <- readLines(ARMReturn, encoding = "ascii")
>
>quartile <- NULL
>numchar <- nchar(x = data, type = "chars")
>start <- c(seq(1, numchar, 399))
>end <- c(seq(399, numchar, 399))
>quartile <- str_sub(data, start[1:length(start)], end[1:length(end)])
>write(quartile, ARMReturn)
>
># The file has been parsed accroding to length 400 for each data
>element.
># The next step is to remove all the trailing white space hex character
># x20
>
>sedcommand2 <- paste("sed -e '/\\x20/d' <", 
>filepath, "arm.csv", 
>">", "arm2.csv", sep = "")
>system(sedcommand2)
>} # end of function
>
>
>c("                                                 555556
>WS320021201006125{000378{000348{                                       
>                                                                    ", 
>"                                                  555556
>WS320021201006250{000954{000880{                                       
>                                                                    ", 
>"                                                   555556
>WS320021201005625{001062{000983{                                       
>                                                                    ", 
>"                                                    555556
>WS320030101005250{000027{000025{                                       
>                                                                    ", 
>"                                                     555556
>WS320030101006500{000033{000030{                                       
>                                                                    ", 
>"                                                      555556
>WS320030101005125{000061{000056{                                       
>                                                                    ", 
>"                                                       555556
>WS320030101005375{000095{000088{                                       
>                                                                    ", 
>"                                                        555556
>WS320030101005350{000217{000200{                                       
>                                                                    ", 
>"                                                         555556
>WS320030101006125{000400{000369{                                       
>                                                                    ", 
>"                                                          555556
>WS320030101005310{000439{000406{                                       
>                                                                    ", 
>"                                                           555556
>WS320030101006000{000573{000529{                                       
>                                                                      "
>
>
>
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

R help - Sep 2016 - using a regular expression

[R] using a regular expression

[R] using a regular expression

[R] using a regular expression

[R] using a regular expression