On Feb 9, 2010, at 11:24 AM, Alex Levitchi wrote:
> Hello
> I am recently began to work with R, so I am not so experienced.
> But anyway I cannot find a clear way to process my dataframe which
> is a bigger one.
> It shows similar to this
>
>>
name=c("A","B","C","B","C","C","C","B","C")
>>
nicknames=c("A1","B1","C1","B2","C2","C3","C4","B3","C5")
>> value=c(4,5,9,2,7,6,3,6,7)
>> table=data.frame(cbind(name,nickname,value))
>> table=data.frame(cbind(name,nicknames,value))
>> table
> name nicknames value
> 1 A A1 4
> 2 B B1 5
> 3 C C1 9
> 4 B B2 2
> 5 C C2 7
> 6 C C3 6
> 7 C C4 3
> 8 B B3 6
> 9 C C5 7
>
> So I have to rearrange it in the next way:
> - the first column should contain just unduplicated data, I did
> this, it is OK and it will look like
> 1 A
> 2 B
> 3 C
>
> - the second column should contain different 'nicknames' which
> correspond to the single A, B or C
> name nickname value
> 1 A A1
> 2 B B1,B2,B3
> 3 C C1,C2,C3,C4,C5
Dataframes are not designed to hold irregular length items. Lists are
the data structure best suited for this type of data. tapply() is one
function useful for colecting elements of one structure based on the
contents of another ("name"):
(I renamed your table object "table1" to avoid confusion with the
table function.)
> tapply(table1$nicknames, table1$name, list)
$A
[1] A1
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5
$B
[1] B1 B2 B3
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5
$C
[1] C1 C2 C3 C4 C5
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5
The process of tabulating has created factor variables which some
would see as a good thing, but perhaps was not desired. Since you now
have a lis, you can sequentially apply the as.character function to
recover only the character vectors:
>lapply( tapply(table1$nicknames, table1$name, list), as.character)
$A
[1] "A1"
$B
[1] "B1" "B2" "B3"
$C
[1] "C1" "C2" "C3" "C4" "C5"
Then I saw the rest of your request, so forget the above and see if
this two-liner looks a bit more simple.
> tcollapse <- tapply(table1$nicknames, table1$name, paste,
collapse=", ")
#gets you the strings separated by commas and spaces.
> cbind(names(tcollapse), tcollapse, lapply( tapply(table1$nicknames,
table1$name, list), length) )
tcollapse
A "A" "A1" 1
B "B" "B1, B2, B3" 3
C "C" "C1, C2, C3, C4, C5" 5
You can obviously name them whatever you like.
--
David>
> -the third one should contain the mean value of the numbers which
> correspond to the same A, B or C
> 1 A A1 mean(4)
> 2 B B1,B2,B3 mean(5,2,6)
> 3 C C1,C2,C3,C4,C5 mean(9,7,6,3,7)
>
> I did this using a loop 'for'.
> to be clear I created tree dataframes which correspond to each of
> columns, and finally will combine them
>
>> ulist=which(!duplicated(table$name)) # I extract the list of
>> positions in which I don't have duplications
>> name1=data.frame(table$name[ulist]) # I extract the list of unique
>> names
>> nicknames1=data.frame(row.names(1:length(ulist))) # I create a
>> dataframe of dimension equal to unique list length
>> value1=data.frame(row.names(1:length(ulist))) # I create a
>> dataframe of dimension equal to unique list length
>
>> for(i in 1:length(ulist)) {
> position=which(as.character(name1[i,1])==table$name)
> nicknames1[i,1]=toString(table$nicknames[position])
> value1[i,1]=mean(as.numeric(table$value[position]))
> }
>> fin=cbind(name1,nicknames1,value1)
>>
colnames(fin)=c("NAME","NICKNAME","VALUE")
>> fin
> NAME NICKNAME VALUE
> 1 A A1 3.000000
> 2 B B1, B2, B3 3.333333
> 3 C C1, C2, C3, C4, C5 5.200000
>
> it works successfully. But in general I work with dataframes of high
> dimensions (tens thousands or more rows).
> So my loop works too slow (i.e., a dataframe of 20000 rows and 3
> columns is processed in about 10 minutes).
> I intend to integrate it into a function, so it is obvious that time
> will be even longer.
>
> If someone can advise me any possibility to modify which I have done
> or to the way I can do it, please give me a message.
>
> King regards to all guys who develop and maintain R sources for such
> dummies as me
> Alex Levitchi
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT