João Azevedo Patrício
2014-Jul-04 12:50 UTC
[R] Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions
Hi, I've been trying to solve this issue but with no success. I have some data like this: 1 > TC WC 2 > 0 Instruments & Instrumentation; Nuclear Science & Technology; Physics, Particles & Fields; Spectroscopy 3 > 0 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary; Physics, Applied 4 > 2 Physics, Nuclear; Physics, Particles & Fields 5 > 0 Chemistry, Inorganic & Nuclear 6 > 2 Chemistry, Physical; Materials Science, Multidisciplinary; Metallurgy & Metallurgical Engineering And I need to have this: 1 > TC WC 2 > 0 Instruments & Instrumentation 2 > 0 Nuclear Science & Technology 2 > 0 Physics, Particles & Fields 2 > 0 Spectroscopy 3 > 0 Nanoscience & Nanotechnology 3 > 0 Materials Science, Multidisciplinary 3 > 0 Physics, Applied 4 > 2 Physics, Nuclear 4 > 2 Physics, Particles & Fields 5 > 0 Chemistry, Inorganic & Nuclear 6 > 2 Chemistry, Physical 6 > 2 Materials Science, Multidisciplinary 6 > 2 Metallurgy & Metallurgical Engineering This means repeat the row for each element in WC and keeping the same value in TC. The goal is to check how many TC (sum) there are by WC, when WC is multiple. i've tried to separate the column using strsplt but then I cannot keep the track of TC. thanks in advance. -- Jo?o Azevedo Patr?cio Tel.: +31 91 400 53 63 Portugal @ http://tripaforra.bl.ee "Take 2 seconds to think before you act"
arun
2014-Jul-04 14:15 UTC
[R] Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions
Hi, Try: dat1 <- read.table(text="'1 > TC' 'WC' '2 > 0'? 'Instruments & Instrumentation; Nuclear Science & Technology;Physics, Particles & Fields; Spectroscopy' '3 > 0' 'Nanoscience & Nanotechnology; Materials Science,Multidisciplinary; Physics, Applied' '4 > 2'??? 'Physics, Nuclear; Physics, Particles & Fields' '5 > 0'??? 'Chemistry, Inorganic & Nuclear' '6 > 2'??? 'Chemistry, Physical; Materials Science, Multidisciplinary;Metallurgy & Metallurgical Engineering'",sep="",header=F, stringsAsFactors=F) library(data.table) Using `cSplit()` from https://gist.github.com/mrdwab/11380733 cSplit(dat1, "V2", ";", "long") ??????? V1???????????????????????????????????? V2 ?1: 1 > TC???????????????????????????????????? WC ?2:? 2 > 0????????? Instruments & Instrumentation ?3:? 2 > 0?????????? Nuclear Science & Technology ?4:? 2 > 0??????????? Physics, Particles & Fields ?5:? 2 > 0?????????????????????????? Spectroscopy ?6:? 3 > 0?????????? Nanoscience & Nanotechnology ?7:? 3 > 0??? Materials Science,Multidisciplinary ?8:? 3 > 0?????????????????????? Physics, Applied ?9:? 4 > 2?????????????????????? Physics, Nuclear 10:? 4 > 2??????????? Physics, Particles & Fields 11:? 5 > 0???????? Chemistry, Inorganic & Nuclear 12:? 6 > 2??????????????????? Chemistry, Physical 13:? 6 > 2?? Materials Science, Multidisciplinary 14:? 6 > 2 Metallurgy & Metallurgical Engineering A.K. On Friday, July 4, 2014 9:53 AM, Jo?o Azevedo Patr?cio <joao.patricio at gmx.pt> wrote: Hi, I've been trying to solve this issue but with no success. I have some data like this: 1 > TC??? WC 2 > 0??? Instruments & Instrumentation; Nuclear Science & Technology; Physics, Particles & Fields; Spectroscopy 3 > 0??? Nanoscience & Nanotechnology; Materials Science, Multidisciplinary; Physics, Applied 4 > 2??? Physics, Nuclear; Physics, Particles & Fields 5 > 0??? Chemistry, Inorganic & Nuclear 6 > 2??? Chemistry, Physical; Materials Science, Multidisciplinary; Metallurgy & Metallurgical Engineering And I need to have this: 1 > TC??? WC 2 > 0??? Instruments & Instrumentation 2 > 0??? Nuclear Science & Technology 2 > 0??? Physics, Particles & Fields 2 > 0??? Spectroscopy 3 > 0??? Nanoscience & Nanotechnology 3 > 0??? Materials Science, Multidisciplinary 3 > 0??? Physics, Applied 4 > 2??? Physics, Nuclear 4 > 2??? Physics, Particles & Fields 5 > 0??? Chemistry, Inorganic & Nuclear 6 > 2??? Chemistry, Physical 6 > 2??? Materials Science, Multidisciplinary 6 > 2??? Metallurgy & Metallurgical Engineering This means repeat the row for each element in WC and keeping the same value in TC. The goal is to check how many TC (sum) there are by WC, when WC is multiple. i've tried to separate the column using strsplt but then I cannot keep the track of TC. thanks in advance. -- Jo?o Azevedo Patr?cio Tel.: +31 91 400 53 63 Portugal @ http://tripaforra.bl.ee "Take 2 seconds to think before you act" ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
John McKown
2014-Jul-05 02:35 UTC
[R] Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions
On Fri, Jul 4, 2014 at 7:50 AM, Jo?o Azevedo Patr?cio <joao.patricio at gmx.pt> wrote:> Hi, > > I've been trying to solve this issue but with no success. > > I have some data like this: > > 1 > TC WC > 2 > 0 Instruments & Instrumentation; Nuclear Science & Technology; > Physics, Particles & Fields; Spectroscopy > 3 > 0 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary; > Physics, Applied > 4 > 2 Physics, Nuclear; Physics, Particles & Fields > 5 > 0 Chemistry, Inorganic & Nuclear > 6 > 2 Chemistry, Physical; Materials Science, Multidisciplinary; > Metallurgy & Metallurgical Engineering > > And I need to have this: > > 1 > TC WC > 2 > 0 Instruments & Instrumentation > 2 > 0 Nuclear Science & Technology > 2 > 0 Physics, Particles & Fields > 2 > 0 Spectroscopy > 3 > 0 Nanoscience & Nanotechnology > 3 > 0 Materials Science, Multidisciplinary > 3 > 0 Physics, Applied > 4 > 2 Physics, Nuclear > 4 > 2 Physics, Particles & Fields > 5 > 0 Chemistry, Inorganic & Nuclear > 6 > 2 Chemistry, Physical > 6 > 2 Materials Science, Multidisciplinary > 6 > 2 Metallurgy & Metallurgical Engineering > > This means repeat the row for each element in WC and keeping the same value > in TC. The goal is to check how many TC (sum) there are by WC, when WC is > multiple. > > i've tried to separate the column using strsplt but then I cannot keep the > track of TC. > > thanks in advance. > -- > Jo?o Azevedo Patr?cioBest that I've come up with, which seems to give the result desired from the example data given. splitAtSemiColon <- function(input) { z <- strsplit(input$WC,';'); result <- data.table(TC=rep(input$TC,sapply(z,length)), WC=unlist(z)); return(result); } flatted.data <- splitAtSemiColon(original.data); <transcript>> print(original.data,right=FALSE)TC 1 0 2 0 3 2 4 0 5 2 WC 1 Instruments & Instrumentation; Nuclear Science & Technology; Physics, Particles & Fields; Spectroscopy 2 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary; Physics, Applied 3 Physics, Nuclear; Physics, Particles & Fields 4 Chemistry, Inorganic & Nuclear 5 Chemistry, Physical; Materials Science, Multidisciplinary; Metallurgy & Metallurgical Engineering> >> print(splitAtSemiColon,right=FALSE);function(x) { z=strsplit(x$WC,';'); result3=data.frame(TC=rep(x$TC,sapply(z,length)),WC=unlist(z)); return(result3); }> print(splitAtSemiColon(original.data),right=FALSE);TC WC 1 0 Instruments & Instrumentation 2 0 Nuclear Science & Technology 3 0 Physics, Particles & Fields 4 0 Spectroscopy 5 0 Nanoscience & Nanotechnology 6 0 Materials Science, Multidisciplinary 7 0 Physics, Applied 8 2 Physics, Nuclear 9 2 Physics, Particles & Fields 10 0 Chemistry, Inorganic & Nuclear 11 2 Chemistry, Physical 12 2 Materials Science, Multidisciplinary 13 2 Metallurgy & Metallurgical Engineering Note that I still have a problem in that the WC data can have leading and/or trailing blanks due to the say that strsplit works. The easiest way to fix this is to use the strtrim() function from the stringr package. -- There is nothing more pleasant than traveling and meeting new people! Genghis Khan Maranatha! <>< John McKown