thr3ads.net - R help - [R] Parsing question, partly comma separated partly underscore separated string [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Eric Fail

2011-Mar-07 03:13 UTC

[R] Parsing question, partly comma separated partly underscore separated string

Dear R-list,

I have a partly comma separated partly underscore separated string that I am
trying to parse into R.

Furthermore I have a bunch of them, and they are quite long. I have now spent
most of my Sunday trying to figure this out and thought I would try the list to
see if someone here would be able to get me started.

My data structure looks like this,

(in a example.txt file)
Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960 ?pixels,
On Device M, M, 3.2.4,ZZ_373_462_488_TRT at
9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT
at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at
9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON
at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at
9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,322,842,32.96,ZZ_004_088_403_CON at
9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON
at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at
9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON
at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON
at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at
9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT
at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at
9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,320,852,4.04,
(end of example.txt file)

The above is approximate 5% of the length of a full file, and then I got about
100 of them. Please note that the strings end with a comma.

I am trying to parse it into something like this 

ID ImgNam BLOCK RUN Tx Ty Treatment x y Y
Subject ID 373 1 1 462 488 TRT 592 820 3.35
Subject ID 32 1 2 288 436 CON 332 878 3.66
Subject ID 384 1 3 204 433 TRT 334 824 3.28
Subject ID 365 1 4 575 683 TRT 598 878 3.5
Subject ID 5 1 5 480 239 CON 630 856 8.03
Subject ID 30 1 6 423 394 CON 98 846 4.09
Subject ID 33 1 7 596 398 CON 636 902 3.28
Subject ID 263 1 8 64 320 TRT 570 894 1.26
Subject ID 4 2 1 88 403 CON 606 908 3.32
Subject ID 703 2 2 546 434 CON 624 934 2.58
Subject ID 712 2 3 348 543 CON 20 828 5.36
Subject ID 5 2 4 48 239 CON 580 830 4.36
Subject ID 310 2 5 444 623 TRT 586 806 0.08
Subject ID 30 2 6 423 394 CON 350 854 3.84
Subject ID 340 2 7 382 539 TRT 570 894 1.26
Subject ID 345 3 1 230 662 TRT 632 844 2.47
Subject ID 6 3 2 335 309 CON 96 930 3.63
Subject ID 782 3 3 346 746 TRT 306 850 2.58
Subject ID 334 3 4 200 333 TRT 304 842 3.34
Subject ID 383 3 5 506 726 TRT 622 884 3.84
Subject ID 294 3 6 360 448 TRT 90 858 3.56
Subject ID 334 3 7 335 473 TRT 570 894 1.26

I could do it in Excel, but it would take me a week--and it would be stupid--if
someone could please help me get started I would very much appreciate it. It
would not only benefit me, but my colleagues would see the benefit of R and the
R-list in particular.

Thanks in advance!

Eric

--

Don McKenzie

2011-Mar-07 04:39 UTC

head link

[R] Parsing question, partly comma separated partly underscore separated string

On 6-Mar-11, at 7:13 PM, Eric Fail wrote:
> Dear R-list,
>
> I have a partly comma separated partly underscore separated string  
> that I am trying to parse into R.
>
> Furthermore I have a bunch of them, and they are quite long. I have  
> now spent most of my Sunday trying to figure this out and thought I  
> would try the list to see if someone here would be able to get me  
> started.
>
> My data structure looks like this,
>
> (in a example.txt file)
> Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by  
> 960  pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg, 
> 592,820,3.35,ZZ_032_288_436_CON at 9z.svg, 
> 332,878,3.66,ZZ_384_204_433_TRT at 9z.svg, 
> 334,824,3.28,ZZ_365_575_683_TRT at 9z.svg, 
> 598,878,3.50,ZZ_005_480_239_CON at 9z.svg, 
> 630,856,8.03,ZZ_030_423_394_CON at 9z.svg, 
> 98,846,4.09,ZZ_033_596_398_CON at 9z.svg, 
> 636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg, 
> 322,842,32.96,ZZ_004_088_403_CON at 9z.svg, 
> 606,908,3.32,ZZ_703_546_434_CON at 9z.svg, 
> 624,934,2.58,ZZ_712_348_543_CON at 9z.svg, 
> 20,828,5.36,ZZ_005_48_239_CON at 9z.svg, 
> 580,830,4.36,ZZ_310_444_623_TRT at 9z.svg, 
> 586,806,0.08,ZZ_030_423_394_CON at 9z.svg, 
> 350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg, 
> 542,840,4.44,ZZ_345_230_662_TRT at 9z.svg, 
> 632,844,2.47,ZZ_006_335_309_CON at 9z.svg, 
> 96,930,3.63,ZZ_782_346_746_TRT at 9z.svg, 
> 306,850,2.58,ZZ_334_200_333_TRT at 9z.svg, 
> 304,842,3.34,ZZ_383_506_726_TRT at 9z.svg, 
> 622,884,3.84,ZZ_294_360_448_TRT at 9z.svg, 
> 90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg, 
> 320,852,4.04,
> (end of example.txt file)
>
> The above is approximate 5% of the length of a full file, and then  
> I got about 100 of them. Please note that the strings end with a  
> comma.
>
> I am trying to parse it into something like this
>
> ID ImgNam BLOCK RUN Tx Ty Treatment x y Y
> Subject ID 373 1 1 462 488 TRT 592 820 3.35
> Subject ID 32 1 2 288 436 CON 332 878 3.66
> Subject ID 384 1 3 204 433 TRT 334 824 3.28
> Subject ID 365 1 4 575 683 TRT 598 878 3.5
> Subject ID 5 1 5 480 239 CON 630 856 8.03
> Subject ID 30 1 6 423 394 CON 98 846 4.09
> Subject ID 33 1 7 596 398 CON 636 902 3.28
> Subject ID 263 1 8 64 320 TRT 570 894 1.26
> Subject ID 4 2 1 88 403 CON 606 908 3.32
> Subject ID 703 2 2 546 434 CON 624 934 2.58
> Subject ID 712 2 3 348 543 CON 20 828 5.36
> Subject ID 5 2 4 48 239 CON 580 830 4.36
> Subject ID 310 2 5 444 623 TRT 586 806 0.08
> Subject ID 30 2 6 423 394 CON 350 854 3.84
> Subject ID 340 2 7 382 539 TRT 570 894 1.26
> Subject ID 345 3 1 230 662 TRT 632 844 2.47
> Subject ID 6 3 2 335 309 CON 96 930 3.63
> Subject ID 782 3 3 346 746 TRT 306 850 2.58
> Subject ID 334 3 4 200 333 TRT 304 842 3.34
> Subject ID 383 3 5 506 726 TRT 622 884 3.84
> Subject ID 294 3 6 360 448 TRT 90 858 3.56
> Subject ID 334 3 7 335 473 TRT 570 894 1.26
>
> I could do it in Excel, but it would take me a week--and it would  
> be stupid--if someone could please help me get started I would very  
> much appreciate it. It would not only benefit me, but my colleagues  
> would see the benefit of R and the R-list in particular.
>
> Thanks in advance!
>
> Eric
>
In a good text editor it would be one command per file.  So if you  
are on UNIX or mac OSX you could loop through files with (probably)  
an awk
command.  I don't remember the syntax (it's been too long) but it  
should be just a few lines of shell script.  In windows I'm not sure  
but there should
be something similar.

Maybe that "gets you started".  Probably one of the list jocks will  
have it nailed if you wait.> --
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
Why does the universe go to all the bother of existing?
-- Stephen Hawking

#define QUESTION ((bb) || !(bb))
-- William Shakespeare



Don McKenzie, Research Ecologist
Pacific WIldland Fire Sciences Lab
US Forest Service

Affiliate Professor
School of Forest Resources, College of the Environment
CSES Climate Impacts Group
University of Washington

desk: 206-732-7824
cell: 206-321-5966
dmck at uw.edu
donaldmckenzie at fs.fed.us

Dennis Murphy

2011-Mar-07 05:55 UTC

head link

[R] Parsing question, partly comma separated partly underscore separated string

Hi:

This should get you most of the way there; I'll let you figure out how to
assign the BLOCK and RUN numbers.

tx <- "Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by
960  pixels, On Device M, M, 3.2.4,
ZZ_373_462_488_TRT@9z.svg,592,820,3.35,ZZ_032_288_436_CON@9z.svg
,332,878,3.66,
ZZ_384_204_433_TRT@9z.svg,334,824,3.28,ZZ_365_575_683_TRT@9z.svg
,598,878,3.50,
ZZ_005_480_239_CON@9z.svg,630,856,8.03,ZZ_030_423_394_CON@9z.svg
,98,846,4.09,
ZZ_033_596_398_CON@9z.svg,636,902,3.28,ZZ_263_064_320_TRT@9z.svg
,570,894,1.26,BLOCK@9z.svg,322,842,32.96,
ZZ_004_088_403_CON@9z.svg,606,908,3.32,ZZ_703_546_434_CON@9z.svg
,624,934,2.58,
ZZ_712_348_543_CON@9z.svg,20,828,5.36,ZZ_005_48_239_CON@9z.svg,580,830,4.36,
ZZ_310_444_623_TRT@9z.svg,586,806,0.08,ZZ_030_423_394_CON@9z.svg
,350,854,3.84,
ZZ_340_382_539_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg,542,840,4.44,
ZZ_345_230_662_TRT@9z.svg,632,844,2.47,ZZ_006_335_309_CON@9z.svg
,96,930,3.63,
ZZ_782_346_746_TRT@9z.svg,306,850,2.58,ZZ_334_200_333_TRT@9z.svg
,304,842,3.34,
ZZ_383_506_726_TRT@9z.svg,622,884,3.84,ZZ_294_360_448_TRT@9z.svg
,90,858,3.56,
ZZ_334_335_473_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg,320,852,4.04,"

# Begin by splitting up the text string tx by 'ZZ_'; this produces a
list.
# Then use lapply to split again by @9z.svg and remove the first element
(basically the first row of tx above)
lst <- strsplit(tx, 'ZZ_')
lst2 <- lapply(lst[[1]], strsplit, '@9z.svg')
lst2[[1]] <- NULL

# This is a function that breaks up the two strings, one separated by _, the
other by ','
# If a third string exists in a list component, it is ignored.
stringBreak <- function(svec) {
  svec <- unlist(svec)
  u <- svec[1]
  v <- svec[2]

  us <- unlist(strsplit(u, '_'))
# since this string starts with a comma, remove the first empty string
  vs <- unlist(strsplit(v, ','))[-1]
# check for presence of 'BLOCK' string
  if(length(vs) == 4) endblock = 1 else endblock = 0
# write elements to a one-line data frame
  data.frame(IngNam = as.numeric(vs[1]),
    Tx = as.numeric(us[2]),
    Ty = as.numeric(us[3]),
    Treatment = us[4],
    x = as.numeric(vs[1]),
    y = as.numeric(vs[2]),
    Y = as.numeric(vs[3]),
    endblock = endblock)
  }

# Slurp into a data frame:

# Method 1: package plyr
library(plyr)
df0 <- ldply(lst2, stringBreak)

# Method 2: do.call()
df0 <- do.call(rbind, lapply(lst2, stringBreak))

Result:> ldply(lst2, stringBreak)   IngNam  Tx  Ty Treatment   x   y    Y BLOCK
1     592 462 488       TRT 592 820 3.35     0
2     332 288 436       CON 332 878 3.66     0
3     334 204 433       TRT 334 824 3.28     0
4     598 575 683       TRT 598 878 3.50     0
5     630 480 239       CON 630 856 8.03     0
6      98 423 394       CON  98 846 4.09     0
7     636 596 398       CON 636 902 3.28     0
8     570  64 320       TRT 570 894 1.26     1
9     606  88 403       CON 606 908 3.32     0
10    624 546 434       CON 624 934 2.58     0
11     20 348 543       CON  20 828 5.36     0
12    580  48 239       CON 580 830 4.36     0
13    586 444 623       TRT 586 806 0.08     0
14    350 423 394       CON 350 854 3.84     0
15    570 382 539       TRT 570 894 1.26     1
16    632 230 662       TRT 632 844 2.47     0
17     96 335 309       CON  96 930 3.63     0
18    306 346 746       TRT 306 850 2.58     0
19    304 200 333       TRT 304 842 3.34     0
20    622 506 726       TRT 622 884 3.84     0
21     90 360 448       TRT  90 858 3.56     0
22    570 335 473       TRT 570 894 1.26     1

HTH,
Dennis


On Sun, Mar 6, 2011 at 7:13 PM, Eric Fail <eric.fail@gmx.com> wrote:
> Dear R-list,
>
> I have a partly comma separated partly underscore separated string that I
> am trying to parse into R.
>
> Furthermore I have a bunch of them, and they are quite long. I have now
> spent most of my Sunday trying to figure this out and thought I would try
> the list to see if someone here would be able to get me started.
>
> My data structure looks like this,
>
> (in a example.txt file)
> Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960
>  pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT@9z.svg
> ,592,820,3.35,ZZ_032_288_436_CON@9z.svg
> ,332,878,3.66,ZZ_384_204_433_TRT@9z.svg
> ,334,824,3.28,ZZ_365_575_683_TRT@9z.svg
> ,598,878,3.50,ZZ_005_480_239_CON@9z.svg
> ,630,856,8.03,ZZ_030_423_394_CON@9z.svg
> ,98,846,4.09,ZZ_033_596_398_CON@9z.svg
> ,636,902,3.28,ZZ_263_064_320_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg
> ,322,842,32.96,ZZ_004_088_403_CON@9z.svg
> ,606,908,3.32,ZZ_703_546_434_CON@9z.svg
> ,624,934,2.58,ZZ_712_348_543_CON@9z.svg
> ,20,828,5.36,ZZ_005_48_239_CON@9z.svg
> ,580,830,4.36,ZZ_310_444_623_TRT@9z.svg
> ,586,806,0.08,ZZ_030_423_394_CON@9z.svg
> ,350,854,3.84,ZZ_340_382_539_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg
> ,542,840,4.44,ZZ_345_230_662_TRT@9z.svg
> ,632,844,2.47,ZZ_006_335_309_CON@9z.svg
> ,96,930,3.63,ZZ_782_346_746_TRT@9z.svg
> ,306,850,2.58,ZZ_334_200_333_TRT@9z.svg
> ,304,842,3.34,ZZ_383_506_726_TRT@9z.svg
> ,622,884,3.84,ZZ_294_360_448_TRT@9z.svg
> ,90,858,3.56,ZZ_334_335_473_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg
> ,320,852,4.04,
> (end of example.txt file)
>
> The above is approximate 5% of the length of a full file, and then I got
> about 100 of them. Please note that the strings end with a comma.
>
> I am trying to parse it into something like this
>
> ID ImgNam BLOCK RUN Tx Ty Treatment x y Y
> Subject ID 373 1 1 462 488 TRT 592 820 3.35
> Subject ID 32 1 2 288 436 CON 332 878 3.66
> Subject ID 384 1 3 204 433 TRT 334 824 3.28
> Subject ID 365 1 4 575 683 TRT 598 878 3.5
> Subject ID 5 1 5 480 239 CON 630 856 8.03
> Subject ID 30 1 6 423 394 CON 98 846 4.09
> Subject ID 33 1 7 596 398 CON 636 902 3.28
> Subject ID 263 1 8 64 320 TRT 570 894 1.26
> Subject ID 4 2 1 88 403 CON 606 908 3.32
> Subject ID 703 2 2 546 434 CON 624 934 2.58
> Subject ID 712 2 3 348 543 CON 20 828 5.36
> Subject ID 5 2 4 48 239 CON 580 830 4.36
> Subject ID 310 2 5 444 623 TRT 586 806 0.08
> Subject ID 30 2 6 423 394 CON 350 854 3.84
> Subject ID 340 2 7 382 539 TRT 570 894 1.26
> Subject ID 345 3 1 230 662 TRT 632 844 2.47
> Subject ID 6 3 2 335 309 CON 96 930 3.63
> Subject ID 782 3 3 346 746 TRT 306 850 2.58
> Subject ID 334 3 4 200 333 TRT 304 842 3.34
> Subject ID 383 3 5 506 726 TRT 622 884 3.84
> Subject ID 294 3 6 360 448 TRT 90 858 3.56
> Subject ID 334 3 7 335 473 TRT 570 894 1.26
>
> I could do it in Excel, but it would take me a week--and it would be
> stupid--if someone could please help me get started I would very much
> appreciate it. It would not only benefit me, but my colleagues would see
the
> benefit of R and the R-list in particular.
>
> Thanks in advance!
>
> Eric
>
> --
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Gabor Grothendieck

2011-Mar-07 13:18 UTC

head link

[R] Parsing question, partly comma separated partly underscore separated string

On Sun, Mar 6, 2011 at 10:13 PM, Eric Fail <eric.fail at gmx.com>
wrote:> Dear R-list,
>
> I have a partly comma separated partly underscore separated string that I
am trying to parse into R.
>
> Furthermore I have a bunch of them, and they are quite long. I have now
spent most of my Sunday trying to figure this out and thought I would try the
list to see if someone here would be able to get me started.
>
> My data structure looks like this,
>
> (in a example.txt file)
> Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960
?pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at
9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT
at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at
9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON
at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at
9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,322,842,32.96,ZZ_004_088_403_CON at
9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON
at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at
9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON
at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON
at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at
9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT
at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at
9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,320,852,4.04,
> (end of example.txt file)
>
> The above is approximate 5% of the length of a full file, and then I got
about 100 of them. Please note that the strings end with a comma.
>
> I am trying to parse it into something like this
>
> ID ImgNam BLOCK RUN Tx Ty Treatment x y Y
> Subject ID 373 1 1 462 488 TRT 592 820 3.35
> Subject ID 32 1 2 288 436 CON 332 878 3.66
> Subject ID 384 1 3 204 433 TRT 334 824 3.28
> Subject ID 365 1 4 575 683 TRT 598 878 3.5
> Subject ID 5 1 5 480 239 CON 630 856 8.03
> Subject ID 30 1 6 423 394 CON 98 846 4.09
> Subject ID 33 1 7 596 398 CON 636 902 3.28
> Subject ID 263 1 8 64 320 TRT 570 894 1.26
> Subject ID 4 2 1 88 403 CON 606 908 3.32
> Subject ID 703 2 2 546 434 CON 624 934 2.58
> Subject ID 712 2 3 348 543 CON 20 828 5.36
> Subject ID 5 2 4 48 239 CON 580 830 4.36
> Subject ID 310 2 5 444 623 TRT 586 806 0.08
> Subject ID 30 2 6 423 394 CON 350 854 3.84
> Subject ID 340 2 7 382 539 TRT 570 894 1.26
> Subject ID 345 3 1 230 662 TRT 632 844 2.47
> Subject ID 6 3 2 335 309 CON 96 930 3.63
> Subject ID 782 3 3 346 746 TRT 306 850 2.58
> Subject ID 334 3 4 200 333 TRT 304 842 3.34
> Subject ID 383 3 5 506 726 TRT 622 884 3.84
> Subject ID 294 3 6 360 448 TRT 90 858 3.56
> Subject ID 334 3 7 335 473 TRT 570 894 1.26
>
> I could do it in Excel, but it would take me a week--and it would be
stupid--if someone could please help me get started I would very much appreciate
it. It would not only benefit me, but my colleagues would see the benefit of R
and the R-list in particular.
>
Try this.  We split the line by ZZ_ giving s and remove the junk after
the word BLOCK giving s2.  Then we remove @9z.svg giving s3 and
convert each _ to , giving s4.  We then read it into a data frame
using comma as the separator, calculate the block and run columns,
remove one junk column and assign column names.
> Line <- "Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4,
640 by 960  pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at
9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT
at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at
9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON
at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at
9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,322,842,32.96,ZZ_004_088_403_CON at
9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON
at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at
9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON
at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON
at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at
9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT
at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at
9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,320,852,4.04,"
>
> s <- strsplit(Line, "ZZ_")[[1]]
> s2 <- sub("BLOCK.*", "BLOCK", s)
> s3 <- sub("@9z.svg", "", s2)
> s4 <- gsub("_", ",", s3)
> DF <- read.table(textConnection(s4), skip = 1, sep = ",",
as.is = TRUE)
> DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1,
-1)
> DF$run <- ave(DF$block, DF$block, FUN = seq_along)
> DF$V8 <- NULL
> names(DF) <- c("IngNam", "Tx", "Ty",
"Treatment", "x", "y", "Y",
"BLOCK", "RUN")
> DF   IngNam  Tx  Ty Treatment   x   y    Y BLOCK RUN
1     373 462 488       TRT 592 820 3.35     1   1
2      32 288 436       CON 332 878 3.66     1   2
3     384 204 433       TRT 334 824 3.28     1   3
4     365 575 683       TRT 598 878 3.50     1   4
5       5 480 239       CON 630 856 8.03     1   5
6      30 423 394       CON  98 846 4.09     1   6
7      33 596 398       CON 636 902 3.28     1   7
8     263  64 320       TRT 570 894 1.26     1   8
9       4  88 403       CON 606 908 3.32     2   1
10    703 546 434       CON 624 934 2.58     2   2
11    712 348 543       CON  20 828 5.36     2   3
12      5  48 239       CON 580 830 4.36     2   4
13    310 444 623       TRT 586 806 0.08     2   5
14     30 423 394       CON 350 854 3.84     2   6
15    340 382 539       TRT 570 894 1.26     2   7
16    345 230 662       TRT 632 844 2.47     3   1
17      6 335 309       CON  96 930 3.63     3   2
18    782 346 746       TRT 306 850 2.58     3   3
19    334 200 333       TRT 304 842 3.34     3   4
20    383 506 726       TRT 622 884 3.84     3   5
21    294 360 448       TRT  90 858 3.56     3   6
22    334 335 473       TRT 570 894 1.26     3   7


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Eric Fail

2011-Mar-08 03:45 UTC

head link

[R] Parsing question, partly comma separated partly underscore separated string

Thanks to Gabor Grothendieck and Dennis Murphy I can now solve first
part of my problem and already impress my colleagues with the
R-program below (I know it could be written in a smarter way, but I am
learning). It reads my partly comma separated partly underscore
separated string and cleans it up in a very need way.

Regardless of my inability to write tight code I moved on to the
second part of my quest, to put it all in to a loop to be able to loop
over my approximately 100 .txt files in /usr2/username/data/ I got
started with list.files() and my loop is more or less working, but I
got stuck on the last cbind part.

Is there a friendly R-hacker out there that would be willing to take a
look at my loop below*2?

Thanks,
Eric

###########################################
##                                                                    ##
##   The answer to the first part  of my question   ##
##                                                                    ##
###########################################

Line <- readLines(file("/usr2/efail/data/example.txt"))
s <- strsplit(Line, "ZZ_")[[1]]
s2 <- sub("BLOCK.*", "BLOCK", s)
s3 <- sub("@9z.svg", "", s2)
s4 <- gsub("_", ",", s3)
s5 <- read.table(textConnection(s4[1]), sep = ",")
DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is =
TRUE)
DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1)
DF$run <- ave(DF$block, DF$block, FUN = seq_along)
DF$V8 <- NULL
names(DF) <- c("IngNam", "Tx", "Ty",
"Treatment", "x", "y", "Y",
"BLOCK", "RUN")
DF$ID <- s5$V1
DF


#####################################
##                                                          ##
##       The PARTLY WORKING loop        ##
##                                                          ##
#####################################

fname <- list.files("/usr2/efail/data",pattern=".txt",
full.names TRUE, recursive =TRUE, ignore.case = TRUE)

for (sp in 1:length(fname)) {
Line <- readLines(file(fname[sp]))
s <- strsplit(Line, "ZZ_")[[1]]
s2 <- sub("BLOCK.*", "BLOCK", s)
s3 <- sub("@9z.svg", "", s2)
s4 <- gsub("_", ",", s3)
s5 <- read.table(textConnection(s4[1]), sep = ",")
DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is =
TRUE)
DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1)
DF$run <- ave(DF$block, DF$block, FUN = seq_along)
DF$V8 <- NULL
names(DF) <- c("IngNam", "Tx", "Ty",
"Treatment", "x", "y", "Y",
"BLOCK", "RUN")
DF$ID <- s5$V1
FINAL.DF <- cbind(DF? ## This is where I got stuck.
}


On Mon, Mar 7, 2011 at 8:18 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:> On Sun, Mar 6, 2011 at 10:13 PM, Eric Fail <eric.fail at gmx.com>
wrote:
>> Dear R-list,
>>
>> I have a partly comma separated partly underscore separated string that
I am trying to parse into R.
>>
>> Furthermore I have a bunch of them, and they are quite long. I have now
spent most of my Sunday trying to figure this out and thought I would try the
list to see if someone here would be able to get me started.
>>
>> My data structure looks like this,
>>
>> (in a example.txt file)
>> Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960
?pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at
9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT
at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at
9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON
at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at
9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,322,842,32.96,ZZ_004_088_403_CON at
9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON
at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at
9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON
at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON
at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at
9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT
at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at
9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,320,852,4.04,
>> (end of example.txt file)
>>
>> The above is approximate 5% of the length of a full file, and then I
got about 100 of them. Please note that the strings end with a comma.
>>
>> I am trying to parse it into something like this
>>
>> ID ImgNam BLOCK RUN Tx Ty Treatment x y Y
>> Subject ID 373 1 1 462 488 TRT 592 820 3.35
>> Subject ID 32 1 2 288 436 CON 332 878 3.66
>> Subject ID 384 1 3 204 433 TRT 334 824 3.28
>> Subject ID 365 1 4 575 683 TRT 598 878 3.5
>> Subject ID 5 1 5 480 239 CON 630 856 8.03
>> Subject ID 30 1 6 423 394 CON 98 846 4.09
>> Subject ID 33 1 7 596 398 CON 636 902 3.28
>> Subject ID 263 1 8 64 320 TRT 570 894 1.26
>> Subject ID 4 2 1 88 403 CON 606 908 3.32
>> Subject ID 703 2 2 546 434 CON 624 934 2.58
>> Subject ID 712 2 3 348 543 CON 20 828 5.36
>> Subject ID 5 2 4 48 239 CON 580 830 4.36
>> Subject ID 310 2 5 444 623 TRT 586 806 0.08
>> Subject ID 30 2 6 423 394 CON 350 854 3.84
>> Subject ID 340 2 7 382 539 TRT 570 894 1.26
>> Subject ID 345 3 1 230 662 TRT 632 844 2.47
>> Subject ID 6 3 2 335 309 CON 96 930 3.63
>> Subject ID 782 3 3 346 746 TRT 306 850 2.58
>> Subject ID 334 3 4 200 333 TRT 304 842 3.34
>> Subject ID 383 3 5 506 726 TRT 622 884 3.84
>> Subject ID 294 3 6 360 448 TRT 90 858 3.56
>> Subject ID 334 3 7 335 473 TRT 570 894 1.26
>>
>> I could do it in Excel, but it would take me a week--and it would be
stupid--if someone could please help me get started I would very much appreciate
it. It would not only benefit me, but my colleagues would see the benefit of R
and the R-list in particular.
>>
>
> Try this. ?We split the line by ZZ_ giving s and remove the junk after
> the word BLOCK giving s2. ?Then we remove @9z.svg giving s3 and
> convert each _ to , giving s4. ?We then read it into a data frame
> using comma as the separator, calculate the block and run columns,
> remove one junk column and assign column names.
>
>> Line <- "Subject ID,ExperimentName,2010-04-23,32:34:23,Version
0.4, 640 by 960 ?pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at
9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT
at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at
9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON
at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at
9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,322,842,32.96,ZZ_004_088_403_CON at
9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON
at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at
9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON
at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON
at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at
9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT
at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at
9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at
9z.svg,320,852,4.04,"
>>
>> s <- strsplit(Line, "ZZ_")[[1]]
>> s2 <- sub("BLOCK.*", "BLOCK", s)
>> s3 <- sub("@9z.svg", "", s2)
>> s4 <- gsub("_", ",", s3)
>> DF <- read.table(textConnection(s4), skip = 1, sep = ",",
as.is = TRUE)
>> DF$block <- head(cumsum(c("", DF$V8) ==
"BLOCK")+1, -1)
>> DF$run <- ave(DF$block, DF$block, FUN = seq_along)
>> DF$V8 <- NULL
>> names(DF) <- c("IngNam", "Tx", "Ty",
"Treatment", "x", "y", "Y",
"BLOCK", "RUN")
>> DF
> ? IngNam ?Tx ?Ty Treatment ? x ? y ? ?Y BLOCK RUN
> 1 ? ? 373 462 488 ? ? ? TRT 592 820 3.35 ? ? 1 ? 1
> 2 ? ? ?32 288 436 ? ? ? CON 332 878 3.66 ? ? 1 ? 2
> 3 ? ? 384 204 433 ? ? ? TRT 334 824 3.28 ? ? 1 ? 3
> 4 ? ? 365 575 683 ? ? ? TRT 598 878 3.50 ? ? 1 ? 4
> 5 ? ? ? 5 480 239 ? ? ? CON 630 856 8.03 ? ? 1 ? 5
> 6 ? ? ?30 423 394 ? ? ? CON ?98 846 4.09 ? ? 1 ? 6
> 7 ? ? ?33 596 398 ? ? ? CON 636 902 3.28 ? ? 1 ? 7
> 8 ? ? 263 ?64 320 ? ? ? TRT 570 894 1.26 ? ? 1 ? 8
> 9 ? ? ? 4 ?88 403 ? ? ? CON 606 908 3.32 ? ? 2 ? 1
> 10 ? ?703 546 434 ? ? ? CON 624 934 2.58 ? ? 2 ? 2
> 11 ? ?712 348 543 ? ? ? CON ?20 828 5.36 ? ? 2 ? 3
> 12 ? ? ?5 ?48 239 ? ? ? CON 580 830 4.36 ? ? 2 ? 4
> 13 ? ?310 444 623 ? ? ? TRT 586 806 0.08 ? ? 2 ? 5
> 14 ? ? 30 423 394 ? ? ? CON 350 854 3.84 ? ? 2 ? 6
> 15 ? ?340 382 539 ? ? ? TRT 570 894 1.26 ? ? 2 ? 7
> 16 ? ?345 230 662 ? ? ? TRT 632 844 2.47 ? ? 3 ? 1
> 17 ? ? ?6 335 309 ? ? ? CON ?96 930 3.63 ? ? 3 ? 2
> 18 ? ?782 346 746 ? ? ? TRT 306 850 2.58 ? ? 3 ? 3
> 19 ? ?334 200 333 ? ? ? TRT 304 842 3.34 ? ? 3 ? 4
> 20 ? ?383 506 726 ? ? ? TRT 622 884 3.84 ? ? 3 ? 5
> 21 ? ?294 360 448 ? ? ? TRT ?90 858 3.56 ? ? 3 ? 6
> 22 ? ?334 335 473 ? ? ? TRT 570 894 1.26 ? ? 3 ? 7
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

Gabor Grothendieck

2011-Mar-08 12:06 UTC

head link

[R] Parsing question, partly comma separated partly underscore separated string

On Mon, Mar 7, 2011 at 10:45 PM, Eric Fail <eric.fail at gmx.com>
wrote:> Thanks to Gabor Grothendieck and Dennis Murphy I can now solve first
> part of my problem and already impress my colleagues with the
> R-program below (I know it could be written in a smarter way, but I am
> learning). It reads my partly comma separated partly underscore
> separated string and cleans it up in a very need way.
>
> Regardless of my inability to write tight code I moved on to the
> second part of my quest, to put it all in to a loop to be able to loop
> over my approximately 100 .txt files in /usr2/username/data/ I got
> started with list.files() and my loop is more or less working, but I
> got stuck on the last cbind part.
>
> Is there a friendly R-hacker out there that would be willing to take a
> look at my loop below*2?
>
Create a function Read(filename) which returns the data frame for the
indicated file and then do this where fname is the vector of
filenames:

   do.call("rbind", lapply(fname, Read))

Reasonably Related Threads

Search for more reasonably related threads

R help - Mar 2011 - Parsing question, partly comma separated partly underscore separated string

[R] Parsing question, partly comma separated partly underscore separated string

[R] Parsing question, partly comma separated partly underscore separated string

[R] Parsing question, partly comma separated partly underscore separated string

[R] Parsing question, partly comma separated partly underscore separated string

[R] Parsing question, partly comma separated partly underscore separated string

[R] Parsing question, partly comma separated partly underscore separated string

Reasonably Related Threads