Eric Fail
2011-Mar-07 03:13 UTC
[R] Parsing question, partly comma separated partly underscore separated string
Dear R-list, I have a partly comma separated partly underscore separated string that I am trying to parse into R. Furthermore I have a bunch of them, and they are quite long. I have now spent most of my Sunday trying to figure this out and thought I would try the list to see if someone here would be able to get me started. My data structure looks like this, (in a example.txt file) Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960 ?pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at 9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at 9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,322,842,32.96,ZZ_004_088_403_CON at 9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at 9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at 9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at 9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,320,852,4.04, (end of example.txt file) The above is approximate 5% of the length of a full file, and then I got about 100 of them. Please note that the strings end with a comma. I am trying to parse it into something like this ID ImgNam BLOCK RUN Tx Ty Treatment x y Y Subject ID 373 1 1 462 488 TRT 592 820 3.35 Subject ID 32 1 2 288 436 CON 332 878 3.66 Subject ID 384 1 3 204 433 TRT 334 824 3.28 Subject ID 365 1 4 575 683 TRT 598 878 3.5 Subject ID 5 1 5 480 239 CON 630 856 8.03 Subject ID 30 1 6 423 394 CON 98 846 4.09 Subject ID 33 1 7 596 398 CON 636 902 3.28 Subject ID 263 1 8 64 320 TRT 570 894 1.26 Subject ID 4 2 1 88 403 CON 606 908 3.32 Subject ID 703 2 2 546 434 CON 624 934 2.58 Subject ID 712 2 3 348 543 CON 20 828 5.36 Subject ID 5 2 4 48 239 CON 580 830 4.36 Subject ID 310 2 5 444 623 TRT 586 806 0.08 Subject ID 30 2 6 423 394 CON 350 854 3.84 Subject ID 340 2 7 382 539 TRT 570 894 1.26 Subject ID 345 3 1 230 662 TRT 632 844 2.47 Subject ID 6 3 2 335 309 CON 96 930 3.63 Subject ID 782 3 3 346 746 TRT 306 850 2.58 Subject ID 334 3 4 200 333 TRT 304 842 3.34 Subject ID 383 3 5 506 726 TRT 622 884 3.84 Subject ID 294 3 6 360 448 TRT 90 858 3.56 Subject ID 334 3 7 335 473 TRT 570 894 1.26 I could do it in Excel, but it would take me a week--and it would be stupid--if someone could please help me get started I would very much appreciate it. It would not only benefit me, but my colleagues would see the benefit of R and the R-list in particular. Thanks in advance! Eric --
Don McKenzie
2011-Mar-07 04:39 UTC
[R] Parsing question, partly comma separated partly underscore separated string
On 6-Mar-11, at 7:13 PM, Eric Fail wrote:> Dear R-list, > > I have a partly comma separated partly underscore separated string > that I am trying to parse into R. > > Furthermore I have a bunch of them, and they are quite long. I have > now spent most of my Sunday trying to figure this out and thought I > would try the list to see if someone here would be able to get me > started. > > My data structure looks like this, > > (in a example.txt file) > Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by > 960 pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg, > 592,820,3.35,ZZ_032_288_436_CON at 9z.svg, > 332,878,3.66,ZZ_384_204_433_TRT at 9z.svg, > 334,824,3.28,ZZ_365_575_683_TRT at 9z.svg, > 598,878,3.50,ZZ_005_480_239_CON at 9z.svg, > 630,856,8.03,ZZ_030_423_394_CON at 9z.svg, > 98,846,4.09,ZZ_033_596_398_CON at 9z.svg, > 636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg, > 322,842,32.96,ZZ_004_088_403_CON at 9z.svg, > 606,908,3.32,ZZ_703_546_434_CON at 9z.svg, > 624,934,2.58,ZZ_712_348_543_CON at 9z.svg, > 20,828,5.36,ZZ_005_48_239_CON at 9z.svg, > 580,830,4.36,ZZ_310_444_623_TRT at 9z.svg, > 586,806,0.08,ZZ_030_423_394_CON at 9z.svg, > 350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg, > 542,840,4.44,ZZ_345_230_662_TRT at 9z.svg, > 632,844,2.47,ZZ_006_335_309_CON at 9z.svg, > 96,930,3.63,ZZ_782_346_746_TRT at 9z.svg, > 306,850,2.58,ZZ_334_200_333_TRT at 9z.svg, > 304,842,3.34,ZZ_383_506_726_TRT at 9z.svg, > 622,884,3.84,ZZ_294_360_448_TRT at 9z.svg, > 90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg, > 320,852,4.04, > (end of example.txt file) > > The above is approximate 5% of the length of a full file, and then > I got about 100 of them. Please note that the strings end with a > comma. > > I am trying to parse it into something like this > > ID ImgNam BLOCK RUN Tx Ty Treatment x y Y > Subject ID 373 1 1 462 488 TRT 592 820 3.35 > Subject ID 32 1 2 288 436 CON 332 878 3.66 > Subject ID 384 1 3 204 433 TRT 334 824 3.28 > Subject ID 365 1 4 575 683 TRT 598 878 3.5 > Subject ID 5 1 5 480 239 CON 630 856 8.03 > Subject ID 30 1 6 423 394 CON 98 846 4.09 > Subject ID 33 1 7 596 398 CON 636 902 3.28 > Subject ID 263 1 8 64 320 TRT 570 894 1.26 > Subject ID 4 2 1 88 403 CON 606 908 3.32 > Subject ID 703 2 2 546 434 CON 624 934 2.58 > Subject ID 712 2 3 348 543 CON 20 828 5.36 > Subject ID 5 2 4 48 239 CON 580 830 4.36 > Subject ID 310 2 5 444 623 TRT 586 806 0.08 > Subject ID 30 2 6 423 394 CON 350 854 3.84 > Subject ID 340 2 7 382 539 TRT 570 894 1.26 > Subject ID 345 3 1 230 662 TRT 632 844 2.47 > Subject ID 6 3 2 335 309 CON 96 930 3.63 > Subject ID 782 3 3 346 746 TRT 306 850 2.58 > Subject ID 334 3 4 200 333 TRT 304 842 3.34 > Subject ID 383 3 5 506 726 TRT 622 884 3.84 > Subject ID 294 3 6 360 448 TRT 90 858 3.56 > Subject ID 334 3 7 335 473 TRT 570 894 1.26 > > I could do it in Excel, but it would take me a week--and it would > be stupid--if someone could please help me get started I would very > much appreciate it. It would not only benefit me, but my colleagues > would see the benefit of R and the R-list in particular. > > Thanks in advance! > > Eric >In a good text editor it would be one command per file. So if you are on UNIX or mac OSX you could loop through files with (probably) an awk command. I don't remember the syntax (it's been too long) but it should be just a few lines of shell script. In windows I'm not sure but there should be something similar. Maybe that "gets you started". Probably one of the list jocks will have it nailed if you wait.> -- > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.Why does the universe go to all the bother of existing? -- Stephen Hawking #define QUESTION ((bb) || !(bb)) -- William Shakespeare Don McKenzie, Research Ecologist Pacific WIldland Fire Sciences Lab US Forest Service Affiliate Professor School of Forest Resources, College of the Environment CSES Climate Impacts Group University of Washington desk: 206-732-7824 cell: 206-321-5966 dmck at uw.edu donaldmckenzie at fs.fed.us
Dennis Murphy
2011-Mar-07 05:55 UTC
[R] Parsing question, partly comma separated partly underscore separated string
Hi: This should get you most of the way there; I'll let you figure out how to assign the BLOCK and RUN numbers. tx <- "Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960 pixels, On Device M, M, 3.2.4, ZZ_373_462_488_TRT@9z.svg,592,820,3.35,ZZ_032_288_436_CON@9z.svg ,332,878,3.66, ZZ_384_204_433_TRT@9z.svg,334,824,3.28,ZZ_365_575_683_TRT@9z.svg ,598,878,3.50, ZZ_005_480_239_CON@9z.svg,630,856,8.03,ZZ_030_423_394_CON@9z.svg ,98,846,4.09, ZZ_033_596_398_CON@9z.svg,636,902,3.28,ZZ_263_064_320_TRT@9z.svg ,570,894,1.26,BLOCK@9z.svg,322,842,32.96, ZZ_004_088_403_CON@9z.svg,606,908,3.32,ZZ_703_546_434_CON@9z.svg ,624,934,2.58, ZZ_712_348_543_CON@9z.svg,20,828,5.36,ZZ_005_48_239_CON@9z.svg,580,830,4.36, ZZ_310_444_623_TRT@9z.svg,586,806,0.08,ZZ_030_423_394_CON@9z.svg ,350,854,3.84, ZZ_340_382_539_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg,542,840,4.44, ZZ_345_230_662_TRT@9z.svg,632,844,2.47,ZZ_006_335_309_CON@9z.svg ,96,930,3.63, ZZ_782_346_746_TRT@9z.svg,306,850,2.58,ZZ_334_200_333_TRT@9z.svg ,304,842,3.34, ZZ_383_506_726_TRT@9z.svg,622,884,3.84,ZZ_294_360_448_TRT@9z.svg ,90,858,3.56, ZZ_334_335_473_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg,320,852,4.04," # Begin by splitting up the text string tx by 'ZZ_'; this produces a list. # Then use lapply to split again by @9z.svg and remove the first element (basically the first row of tx above) lst <- strsplit(tx, 'ZZ_') lst2 <- lapply(lst[[1]], strsplit, '@9z.svg') lst2[[1]] <- NULL # This is a function that breaks up the two strings, one separated by _, the other by ',' # If a third string exists in a list component, it is ignored. stringBreak <- function(svec) { svec <- unlist(svec) u <- svec[1] v <- svec[2] us <- unlist(strsplit(u, '_')) # since this string starts with a comma, remove the first empty string vs <- unlist(strsplit(v, ','))[-1] # check for presence of 'BLOCK' string if(length(vs) == 4) endblock = 1 else endblock = 0 # write elements to a one-line data frame data.frame(IngNam = as.numeric(vs[1]), Tx = as.numeric(us[2]), Ty = as.numeric(us[3]), Treatment = us[4], x = as.numeric(vs[1]), y = as.numeric(vs[2]), Y = as.numeric(vs[3]), endblock = endblock) } # Slurp into a data frame: # Method 1: package plyr library(plyr) df0 <- ldply(lst2, stringBreak) # Method 2: do.call() df0 <- do.call(rbind, lapply(lst2, stringBreak)) Result:> ldply(lst2, stringBreak)IngNam Tx Ty Treatment x y Y BLOCK 1 592 462 488 TRT 592 820 3.35 0 2 332 288 436 CON 332 878 3.66 0 3 334 204 433 TRT 334 824 3.28 0 4 598 575 683 TRT 598 878 3.50 0 5 630 480 239 CON 630 856 8.03 0 6 98 423 394 CON 98 846 4.09 0 7 636 596 398 CON 636 902 3.28 0 8 570 64 320 TRT 570 894 1.26 1 9 606 88 403 CON 606 908 3.32 0 10 624 546 434 CON 624 934 2.58 0 11 20 348 543 CON 20 828 5.36 0 12 580 48 239 CON 580 830 4.36 0 13 586 444 623 TRT 586 806 0.08 0 14 350 423 394 CON 350 854 3.84 0 15 570 382 539 TRT 570 894 1.26 1 16 632 230 662 TRT 632 844 2.47 0 17 96 335 309 CON 96 930 3.63 0 18 306 346 746 TRT 306 850 2.58 0 19 304 200 333 TRT 304 842 3.34 0 20 622 506 726 TRT 622 884 3.84 0 21 90 360 448 TRT 90 858 3.56 0 22 570 335 473 TRT 570 894 1.26 1 HTH, Dennis On Sun, Mar 6, 2011 at 7:13 PM, Eric Fail <eric.fail@gmx.com> wrote:> Dear R-list, > > I have a partly comma separated partly underscore separated string that I > am trying to parse into R. > > Furthermore I have a bunch of them, and they are quite long. I have now > spent most of my Sunday trying to figure this out and thought I would try > the list to see if someone here would be able to get me started. > > My data structure looks like this, > > (in a example.txt file) > Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960 > pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT@9z.svg > ,592,820,3.35,ZZ_032_288_436_CON@9z.svg > ,332,878,3.66,ZZ_384_204_433_TRT@9z.svg > ,334,824,3.28,ZZ_365_575_683_TRT@9z.svg > ,598,878,3.50,ZZ_005_480_239_CON@9z.svg > ,630,856,8.03,ZZ_030_423_394_CON@9z.svg > ,98,846,4.09,ZZ_033_596_398_CON@9z.svg > ,636,902,3.28,ZZ_263_064_320_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg > ,322,842,32.96,ZZ_004_088_403_CON@9z.svg > ,606,908,3.32,ZZ_703_546_434_CON@9z.svg > ,624,934,2.58,ZZ_712_348_543_CON@9z.svg > ,20,828,5.36,ZZ_005_48_239_CON@9z.svg > ,580,830,4.36,ZZ_310_444_623_TRT@9z.svg > ,586,806,0.08,ZZ_030_423_394_CON@9z.svg > ,350,854,3.84,ZZ_340_382_539_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg > ,542,840,4.44,ZZ_345_230_662_TRT@9z.svg > ,632,844,2.47,ZZ_006_335_309_CON@9z.svg > ,96,930,3.63,ZZ_782_346_746_TRT@9z.svg > ,306,850,2.58,ZZ_334_200_333_TRT@9z.svg > ,304,842,3.34,ZZ_383_506_726_TRT@9z.svg > ,622,884,3.84,ZZ_294_360_448_TRT@9z.svg > ,90,858,3.56,ZZ_334_335_473_TRT@9z.svg,570,894,1.26,BLOCK@9z.svg > ,320,852,4.04, > (end of example.txt file) > > The above is approximate 5% of the length of a full file, and then I got > about 100 of them. Please note that the strings end with a comma. > > I am trying to parse it into something like this > > ID ImgNam BLOCK RUN Tx Ty Treatment x y Y > Subject ID 373 1 1 462 488 TRT 592 820 3.35 > Subject ID 32 1 2 288 436 CON 332 878 3.66 > Subject ID 384 1 3 204 433 TRT 334 824 3.28 > Subject ID 365 1 4 575 683 TRT 598 878 3.5 > Subject ID 5 1 5 480 239 CON 630 856 8.03 > Subject ID 30 1 6 423 394 CON 98 846 4.09 > Subject ID 33 1 7 596 398 CON 636 902 3.28 > Subject ID 263 1 8 64 320 TRT 570 894 1.26 > Subject ID 4 2 1 88 403 CON 606 908 3.32 > Subject ID 703 2 2 546 434 CON 624 934 2.58 > Subject ID 712 2 3 348 543 CON 20 828 5.36 > Subject ID 5 2 4 48 239 CON 580 830 4.36 > Subject ID 310 2 5 444 623 TRT 586 806 0.08 > Subject ID 30 2 6 423 394 CON 350 854 3.84 > Subject ID 340 2 7 382 539 TRT 570 894 1.26 > Subject ID 345 3 1 230 662 TRT 632 844 2.47 > Subject ID 6 3 2 335 309 CON 96 930 3.63 > Subject ID 782 3 3 346 746 TRT 306 850 2.58 > Subject ID 334 3 4 200 333 TRT 304 842 3.34 > Subject ID 383 3 5 506 726 TRT 622 884 3.84 > Subject ID 294 3 6 360 448 TRT 90 858 3.56 > Subject ID 334 3 7 335 473 TRT 570 894 1.26 > > I could do it in Excel, but it would take me a week--and it would be > stupid--if someone could please help me get started I would very much > appreciate it. It would not only benefit me, but my colleagues would see the > benefit of R and the R-list in particular. > > Thanks in advance! > > Eric > > -- > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Gabor Grothendieck
2011-Mar-07 13:18 UTC
[R] Parsing question, partly comma separated partly underscore separated string
On Sun, Mar 6, 2011 at 10:13 PM, Eric Fail <eric.fail at gmx.com> wrote:> Dear R-list, > > I have a partly comma separated partly underscore separated string that I am trying to parse into R. > > Furthermore I have a bunch of them, and they are quite long. I have now spent most of my Sunday trying to figure this out and thought I would try the list to see if someone here would be able to get me started. > > My data structure looks like this, > > (in a example.txt file) > Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960 ?pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at 9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at 9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,322,842,32.96,ZZ_004_088_403_CON at 9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at 9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at 9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at 9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,320,852,4.04, > (end of example.txt file) > > The above is approximate 5% of the length of a full file, and then I got about 100 of them. Please note that the strings end with a comma. > > I am trying to parse it into something like this > > ID ImgNam BLOCK RUN Tx Ty Treatment x y Y > Subject ID 373 1 1 462 488 TRT 592 820 3.35 > Subject ID 32 1 2 288 436 CON 332 878 3.66 > Subject ID 384 1 3 204 433 TRT 334 824 3.28 > Subject ID 365 1 4 575 683 TRT 598 878 3.5 > Subject ID 5 1 5 480 239 CON 630 856 8.03 > Subject ID 30 1 6 423 394 CON 98 846 4.09 > Subject ID 33 1 7 596 398 CON 636 902 3.28 > Subject ID 263 1 8 64 320 TRT 570 894 1.26 > Subject ID 4 2 1 88 403 CON 606 908 3.32 > Subject ID 703 2 2 546 434 CON 624 934 2.58 > Subject ID 712 2 3 348 543 CON 20 828 5.36 > Subject ID 5 2 4 48 239 CON 580 830 4.36 > Subject ID 310 2 5 444 623 TRT 586 806 0.08 > Subject ID 30 2 6 423 394 CON 350 854 3.84 > Subject ID 340 2 7 382 539 TRT 570 894 1.26 > Subject ID 345 3 1 230 662 TRT 632 844 2.47 > Subject ID 6 3 2 335 309 CON 96 930 3.63 > Subject ID 782 3 3 346 746 TRT 306 850 2.58 > Subject ID 334 3 4 200 333 TRT 304 842 3.34 > Subject ID 383 3 5 506 726 TRT 622 884 3.84 > Subject ID 294 3 6 360 448 TRT 90 858 3.56 > Subject ID 334 3 7 335 473 TRT 570 894 1.26 > > I could do it in Excel, but it would take me a week--and it would be stupid--if someone could please help me get started I would very much appreciate it. It would not only benefit me, but my colleagues would see the benefit of R and the R-list in particular. >Try this. We split the line by ZZ_ giving s and remove the junk after the word BLOCK giving s2. Then we remove @9z.svg giving s3 and convert each _ to , giving s4. We then read it into a data frame using comma as the separator, calculate the block and run columns, remove one junk column and assign column names.> Line <- "Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960 pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at 9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at 9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,322,842,32.96,ZZ_004_088_403_CON at 9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at 9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at 9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at 9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,320,852,4.04," > > s <- strsplit(Line, "ZZ_")[[1]] > s2 <- sub("BLOCK.*", "BLOCK", s) > s3 <- sub("@9z.svg", "", s2) > s4 <- gsub("_", ",", s3) > DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is = TRUE) > DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1) > DF$run <- ave(DF$block, DF$block, FUN = seq_along) > DF$V8 <- NULL > names(DF) <- c("IngNam", "Tx", "Ty", "Treatment", "x", "y", "Y", "BLOCK", "RUN") > DFIngNam Tx Ty Treatment x y Y BLOCK RUN 1 373 462 488 TRT 592 820 3.35 1 1 2 32 288 436 CON 332 878 3.66 1 2 3 384 204 433 TRT 334 824 3.28 1 3 4 365 575 683 TRT 598 878 3.50 1 4 5 5 480 239 CON 630 856 8.03 1 5 6 30 423 394 CON 98 846 4.09 1 6 7 33 596 398 CON 636 902 3.28 1 7 8 263 64 320 TRT 570 894 1.26 1 8 9 4 88 403 CON 606 908 3.32 2 1 10 703 546 434 CON 624 934 2.58 2 2 11 712 348 543 CON 20 828 5.36 2 3 12 5 48 239 CON 580 830 4.36 2 4 13 310 444 623 TRT 586 806 0.08 2 5 14 30 423 394 CON 350 854 3.84 2 6 15 340 382 539 TRT 570 894 1.26 2 7 16 345 230 662 TRT 632 844 2.47 3 1 17 6 335 309 CON 96 930 3.63 3 2 18 782 346 746 TRT 306 850 2.58 3 3 19 334 200 333 TRT 304 842 3.34 3 4 20 383 506 726 TRT 622 884 3.84 3 5 21 294 360 448 TRT 90 858 3.56 3 6 22 334 335 473 TRT 570 894 1.26 3 7 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Eric Fail
2011-Mar-08 03:45 UTC
[R] Parsing question, partly comma separated partly underscore separated string
Thanks to Gabor Grothendieck and Dennis Murphy I can now solve first part of my problem and already impress my colleagues with the R-program below (I know it could be written in a smarter way, but I am learning). It reads my partly comma separated partly underscore separated string and cleans it up in a very need way. Regardless of my inability to write tight code I moved on to the second part of my quest, to put it all in to a loop to be able to loop over my approximately 100 .txt files in /usr2/username/data/ I got started with list.files() and my loop is more or less working, but I got stuck on the last cbind part. Is there a friendly R-hacker out there that would be willing to take a look at my loop below*2? Thanks, Eric ########################################### ## ## ## The answer to the first part of my question ## ## ## ########################################### Line <- readLines(file("/usr2/efail/data/example.txt")) s <- strsplit(Line, "ZZ_")[[1]] s2 <- sub("BLOCK.*", "BLOCK", s) s3 <- sub("@9z.svg", "", s2) s4 <- gsub("_", ",", s3) s5 <- read.table(textConnection(s4[1]), sep = ",") DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is = TRUE) DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1) DF$run <- ave(DF$block, DF$block, FUN = seq_along) DF$V8 <- NULL names(DF) <- c("IngNam", "Tx", "Ty", "Treatment", "x", "y", "Y", "BLOCK", "RUN") DF$ID <- s5$V1 DF ##################################### ## ## ## The PARTLY WORKING loop ## ## ## ##################################### fname <- list.files("/usr2/efail/data",pattern=".txt", full.names TRUE, recursive =TRUE, ignore.case = TRUE) for (sp in 1:length(fname)) { Line <- readLines(file(fname[sp])) s <- strsplit(Line, "ZZ_")[[1]] s2 <- sub("BLOCK.*", "BLOCK", s) s3 <- sub("@9z.svg", "", s2) s4 <- gsub("_", ",", s3) s5 <- read.table(textConnection(s4[1]), sep = ",") DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is = TRUE) DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1) DF$run <- ave(DF$block, DF$block, FUN = seq_along) DF$V8 <- NULL names(DF) <- c("IngNam", "Tx", "Ty", "Treatment", "x", "y", "Y", "BLOCK", "RUN") DF$ID <- s5$V1 FINAL.DF <- cbind(DF? ## This is where I got stuck. } On Mon, Mar 7, 2011 at 8:18 AM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> On Sun, Mar 6, 2011 at 10:13 PM, Eric Fail <eric.fail at gmx.com> wrote: >> Dear R-list, >> >> I have a partly comma separated partly underscore separated string that I am trying to parse into R. >> >> Furthermore I have a bunch of them, and they are quite long. I have now spent most of my Sunday trying to figure this out and thought I would try the list to see if someone here would be able to get me started. >> >> My data structure looks like this, >> >> (in a example.txt file) >> Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960 ?pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at 9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at 9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,322,842,32.96,ZZ_004_088_403_CON at 9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at 9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at 9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at 9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,320,852,4.04, >> (end of example.txt file) >> >> The above is approximate 5% of the length of a full file, and then I got about 100 of them. Please note that the strings end with a comma. >> >> I am trying to parse it into something like this >> >> ID ImgNam BLOCK RUN Tx Ty Treatment x y Y >> Subject ID 373 1 1 462 488 TRT 592 820 3.35 >> Subject ID 32 1 2 288 436 CON 332 878 3.66 >> Subject ID 384 1 3 204 433 TRT 334 824 3.28 >> Subject ID 365 1 4 575 683 TRT 598 878 3.5 >> Subject ID 5 1 5 480 239 CON 630 856 8.03 >> Subject ID 30 1 6 423 394 CON 98 846 4.09 >> Subject ID 33 1 7 596 398 CON 636 902 3.28 >> Subject ID 263 1 8 64 320 TRT 570 894 1.26 >> Subject ID 4 2 1 88 403 CON 606 908 3.32 >> Subject ID 703 2 2 546 434 CON 624 934 2.58 >> Subject ID 712 2 3 348 543 CON 20 828 5.36 >> Subject ID 5 2 4 48 239 CON 580 830 4.36 >> Subject ID 310 2 5 444 623 TRT 586 806 0.08 >> Subject ID 30 2 6 423 394 CON 350 854 3.84 >> Subject ID 340 2 7 382 539 TRT 570 894 1.26 >> Subject ID 345 3 1 230 662 TRT 632 844 2.47 >> Subject ID 6 3 2 335 309 CON 96 930 3.63 >> Subject ID 782 3 3 346 746 TRT 306 850 2.58 >> Subject ID 334 3 4 200 333 TRT 304 842 3.34 >> Subject ID 383 3 5 506 726 TRT 622 884 3.84 >> Subject ID 294 3 6 360 448 TRT 90 858 3.56 >> Subject ID 334 3 7 335 473 TRT 570 894 1.26 >> >> I could do it in Excel, but it would take me a week--and it would be stupid--if someone could please help me get started I would very much appreciate it. It would not only benefit me, but my colleagues would see the benefit of R and the R-list in particular. >> > > Try this. ?We split the line by ZZ_ giving s and remove the junk after > the word BLOCK giving s2. ?Then we remove @9z.svg giving s3 and > convert each _ to , giving s4. ?We then read it into a data frame > using comma as the separator, calculate the block and run columns, > remove one junk column and assign column names. > >> Line <- "Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960 ?pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at 9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at 9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,322,842,32.96,ZZ_004_088_403_CON at 9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at 9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at 9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at 9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,320,852,4.04," >> >> s <- strsplit(Line, "ZZ_")[[1]] >> s2 <- sub("BLOCK.*", "BLOCK", s) >> s3 <- sub("@9z.svg", "", s2) >> s4 <- gsub("_", ",", s3) >> DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is = TRUE) >> DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1) >> DF$run <- ave(DF$block, DF$block, FUN = seq_along) >> DF$V8 <- NULL >> names(DF) <- c("IngNam", "Tx", "Ty", "Treatment", "x", "y", "Y", "BLOCK", "RUN") >> DF > ? IngNam ?Tx ?Ty Treatment ? x ? y ? ?Y BLOCK RUN > 1 ? ? 373 462 488 ? ? ? TRT 592 820 3.35 ? ? 1 ? 1 > 2 ? ? ?32 288 436 ? ? ? CON 332 878 3.66 ? ? 1 ? 2 > 3 ? ? 384 204 433 ? ? ? TRT 334 824 3.28 ? ? 1 ? 3 > 4 ? ? 365 575 683 ? ? ? TRT 598 878 3.50 ? ? 1 ? 4 > 5 ? ? ? 5 480 239 ? ? ? CON 630 856 8.03 ? ? 1 ? 5 > 6 ? ? ?30 423 394 ? ? ? CON ?98 846 4.09 ? ? 1 ? 6 > 7 ? ? ?33 596 398 ? ? ? CON 636 902 3.28 ? ? 1 ? 7 > 8 ? ? 263 ?64 320 ? ? ? TRT 570 894 1.26 ? ? 1 ? 8 > 9 ? ? ? 4 ?88 403 ? ? ? CON 606 908 3.32 ? ? 2 ? 1 > 10 ? ?703 546 434 ? ? ? CON 624 934 2.58 ? ? 2 ? 2 > 11 ? ?712 348 543 ? ? ? CON ?20 828 5.36 ? ? 2 ? 3 > 12 ? ? ?5 ?48 239 ? ? ? CON 580 830 4.36 ? ? 2 ? 4 > 13 ? ?310 444 623 ? ? ? TRT 586 806 0.08 ? ? 2 ? 5 > 14 ? ? 30 423 394 ? ? ? CON 350 854 3.84 ? ? 2 ? 6 > 15 ? ?340 382 539 ? ? ? TRT 570 894 1.26 ? ? 2 ? 7 > 16 ? ?345 230 662 ? ? ? TRT 632 844 2.47 ? ? 3 ? 1 > 17 ? ? ?6 335 309 ? ? ? CON ?96 930 3.63 ? ? 3 ? 2 > 18 ? ?782 346 746 ? ? ? TRT 306 850 2.58 ? ? 3 ? 3 > 19 ? ?334 200 333 ? ? ? TRT 304 842 3.34 ? ? 3 ? 4 > 20 ? ?383 506 726 ? ? ? TRT 622 884 3.84 ? ? 3 ? 5 > 21 ? ?294 360 448 ? ? ? TRT ?90 858 3.56 ? ? 3 ? 6 > 22 ? ?334 335 473 ? ? ? TRT 570 894 1.26 ? ? 3 ? 7 > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com >
Gabor Grothendieck
2011-Mar-08 12:06 UTC
[R] Parsing question, partly comma separated partly underscore separated string
On Mon, Mar 7, 2011 at 10:45 PM, Eric Fail <eric.fail at gmx.com> wrote:> Thanks to Gabor Grothendieck and Dennis Murphy I can now solve first > part of my problem and already impress my colleagues with the > R-program below (I know it could be written in a smarter way, but I am > learning). It reads my partly comma separated partly underscore > separated string and cleans it up in a very need way. > > Regardless of my inability to write tight code I moved on to the > second part of my quest, to put it all in to a loop to be able to loop > over my approximately 100 .txt files in /usr2/username/data/ I got > started with list.files() and my loop is more or less working, but I > got stuck on the last cbind part. > > Is there a friendly R-hacker out there that would be willing to take a > look at my loop below*2? >Create a function Read(filename) which returns the data frame for the indicated file and then do this where fname is the vector of filenames: do.call("rbind", lapply(fname, Read))