thr3ads.net - R help - [R] Complicated analysis for huge databases [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Allaisone 1

2017-Nov-18 09:52 UTC

[R] Complicated analysis for huge databases

Although the loop seems to be formulated correctly I wonder why
it gives me these errors :

-object 'i' not found
- unexpected '}' in "}"


the desired output is expected to be very large as for each dataframe in the
list of dataframes I expect to see maf value for each of the 600 columns! and
this is only for

for one dataframe in the list .. I have around 150-200 dataframes.. not sure how
R will store these results.. but first I need the analysis to be done correctly.
The final output has to be something like this :-

> mafsforeachcolumns(I,II,...600)foreachcombination
      MealsCombinations    Cust.ID      I              II            III        
IV       ...... 600
1          33-55                          1             0.124      0.10     
0.65       0.467
                                                  3
                                                  5

2      44-66                                7           0.134     0.43      
0.64       0.479
                                                  4
                                                  9

.

.

~180 dataframes


________________________________
From: Boris Steipe <boris.steipe at utoronto.ca>
Sent: 18 November 2017 00:35:16
To: Allaisone 1; R-help
Subject: Re: [R] Complicated analysis for huge databases

Something like the following?

AllMAFs <- list()

for (i in length(SeparatedGroupsofmealsCombs) {
  AllMAFs[[i]] <- apply(SeparatedGroupsofmealsCombs[[i]], 2,
function(x)maf(tabulate(x+1)))
}


(untested, of course)
Also the solution is a bit generic since I don't know what the output of
maf() looks like in your case, and I don't understand why you use tabulate
because I would have assumed that's what maf() does - but that's not for
me to worry about :-)



B.


> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <allaisone1 at hotmail.com>
wrote:
>
>
> Thanks Boris , this was very helpful but I'm struggling with the last
part.
>
> 1) I combined the first 2 columns :-
>
>
> library(tidyr)
> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB),
remove=FALSE)
> SingleMealsCode <- SingleMealsCode[,-2]
>
>   2) I separated this dataframe into different dataframes based on
"MealsCombination"
>    column so R will recognize each meal combination separately :
>
> SeparatedGroupsofmealsCombs <-
split(SingleMealCode,SingleMealCode$MealsCombinations)
>
> after investigating the structure of
"SeparatedGroupsofmealsCombs" , I can see
> a list of different databases, each of which represents a different Meal
combinations which is great.
>
> No, I'm struggling with the last part, how can I run the maf code for
all dataframes?
>
> when I run this code as before :-
>
> maf <- apply(SeparatedGroupsofmealsCombs, 2,
function(x)maf(tabulate(x+1)))
>
> an error message says : dim(X) must have a positive length . I'm not
sure which length
> I need to specify.. any suggestions to correct this syntax ?
>
> Regards
> Allaisone
> From: Boris Steipe <boris.steipe at utoronto.ca>
> Sent: 17 November 2017 21:12:06
> To: Allaisone 1
> Cc: R-help
> Subject: Re: [R] Complicated analysis for huge databases
>
> Combine columns 1 and 2 into a column with a single ID like
"33.55", "44.66" and use split() on these IDs to break up
your dataset. Iterate over the list of data frames split() returns.
>
>
> B.
>
> > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <allaisone1 at
hotmail.com> wrote:
> >
> >
> > Hi all ..,
> >
> >
> > I have a large dataset of around 600,000 rows and 600 columns. The
first col is codes for Meal A, the second columns is codes for Meal B. The third
column is customers IDs where each customer had a combination of meals. Each
column of the rest columns contains values 0,1,or 2. The dataset is organised in
a way so that the first group of customers had similar meals combinations, this
is followed by another group of customers with similar meals combinations but
different from the first group and so on. The dataset looks like this :-
> >
> >
> >> MyData
> >
> >       Meal A     Meal B     Cust.ID      I            II        III   
IV   ...... 600
> >
> > 1    33                 55             1             0           1    
2       0
> >
> > 2    33                 55              3             1          0    
2        2
> >
> > 3    33                 55              5             2          1    
1         2
> >
> > 4    44                 66               7            0          2    
2        2
> >
> > 5   44                  66               4            1          1    
0       1
> >
> > 6   44                  66                9            2          0   
1       2
> >
> > .
> >
> > .
> >
> > 600,000
> >
> >
> >
> > I wanted to find maf() for each column(from 4 to 600) after
calculating the frequency of the 3 values (0,1,2) but this should be done group
by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
> >
> >
> > I can do the analysis  for the entire column but not group by group
like this :
> >
> >
> > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
> >
> > How can I modify this code to tell R to do the analysis group by group
for each column so I get maf value for 33-55 group of clolumn I, then maf value
for group 44-66 in the same column I,then the rest of groups in this column and
do the same for the remaining columns.
> >
> > In fact, I'm interested in doing this analysis for only 300
columns but all of the 600 columns.
> > I have another sheet contains names of columns of interest like this :
> >
> >> ColOfinterest
> >
> > Col
> > I
> > IV
> > V
> > .
> > .
> > 300
> >
> > Any one would help with the best combination of syntax to perform this
complex analysis?
> >
> > Regards
> > Allaisone
> >
> >
> >
> >
> >
> >
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

David Winsemius

2017-Nov-18 20:06 UTC

head link

[R] Complicated analysis for huge databases

> On Nov 18, 2017, at 1:52 AM, Allaisone 1 <allaisone1 at hotmail.com>
wrote:
> 
> Although the loop seems to be formulated correctly I wonder why
> it gives me these errors :
> 
> -object 'i' not found
> - unexpected '}' in "}"
You probably did not copy the entire code offered. But we cannot know since you
did not "show your code", not=r did you post complete error messages.
Both of these practices are strongly recommended by the Posting Guide. Please
read it (again?).

-- 
David.> 
> 
> the desired output is expected to be very large as for each dataframe in
the list of dataframes I expect to see maf value for each of the 600 columns!
and this is only for
> 
> for one dataframe in the list .. I have around 150-200 dataframes.. not
sure how R will store these results.. but first I need the analysis to be done
correctly. The final output has to be something like this :-
> 
> 
>> mafsforeachcolumns(I,II,...600)foreachcombination
> 
>      MealsCombinations    Cust.ID      I              II            III    
IV       ...... 600
> 1          33-55                          1             0.124      0.10    
0.65       0.467
>                                                  3
>                                                  5
> 
> 2      44-66                                7           0.134     0.43     
0.64       0.479
>                                                  4
>                                                  9
> 
> .
> 
> .
> 
> ~180 dataframes
> 
> 
> ________________________________
> From: Boris Steipe <boris.steipe at utoronto.ca>
> Sent: 18 November 2017 00:35:16
> To: Allaisone 1; R-help
> Subject: Re: [R] Complicated analysis for huge databases
> 
> Something like the following?
> 
> AllMAFs <- list()
> 
> for (i in length(SeparatedGroupsofmealsCombs) {
>  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2,
function(x)maf( tabulate( x+1) ))
> }
> 
> 
> (untested, of course)
> Also the solution is a bit generic since I don't know what the output
of maf() looks like in your case, and I don't understand why you use
tabulate because I would have assumed that's what maf() does - but
that's not for me to worry about :-)
> 
> 
> 
> B.
> 
> 
> 
>> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <allaisone1 at
hotmail.com> wrote:
>> 
>> 
>> Thanks Boris , this was very helpful but I'm struggling with the
last part.
>> 
>> 1) I combined the first 2 columns :-
>> 
>> 
>> library(tidyr)
>> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB),
remove=FALSE)
>> SingleMealsCode <- SingleMealsCode[,-2]
>> 
>>  2) I separated this dataframe into different dataframes based on
"MealsCombination"
>>   column so R will recognize each meal combination separately :
>> 
>> SeparatedGroupsofmealsCombs <-
split(SingleMealCode,SingleMealCode$MealsCombinations)
>> 
>> after investigating the structure of
"SeparatedGroupsofmealsCombs" , I can see
>> a list of different databases, each of which represents a different
Meal combinations which is great.
>> 
>> No, I'm struggling with the last part, how can I run the maf code
for all dataframes?
>> 
>> when I run this code as before :-
>> 
>> maf <- apply(SeparatedGroupsofmealsCombs, 2,
function(x)maf(tabulate(x+1)))
>> 
>> an error message says : dim(X) must have a positive length . I'm
not sure which length
>> I need to specify.. any suggestions to correct this syntax ?
>> 
>> Regards
>> Allaisone
>> From: Boris Steipe <boris.steipe at utoronto.ca>
>> Sent: 17 November 2017 21:12:06
>> To: Allaisone 1
>> Cc: R-help
>> Subject: Re: [R] Complicated analysis for huge databases
>> 
>> Combine columns 1 and 2 into a column with a single ID like
"33.55", "44.66" and use split() on these IDs to break up
your dataset. Iterate over the list of data frames split() returns.
>> 
>> 
>> B.
>> 
>>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <allaisone1 at
hotmail.com> wrote:
>>> 
>>> 
>>> Hi all ..,
>>> 
>>> 
>>> I have a large dataset of around 600,000 rows and 600 columns. The
first col is codes for Meal A, the second columns is codes for Meal B. The third
column is customers IDs where each customer had a combination of meals. Each
column of the rest columns contains values 0,1,or 2. The dataset is organised in
a way so that the first group of customers had similar meals combinations, this
is followed by another group of customers with similar meals combinations but
different from the first group and so on. The dataset looks like this :-
>>> 
>>> 
>>>> MyData
>>> 
>>>      Meal A     Meal B     Cust.ID      I            II        III 
IV   ...... 600
>>> 
>>> 1    33                 55             1             0           1 
2       0
>>> 
>>> 2    33                 55              3             1          0 
2        2
>>> 
>>> 3    33                 55              5             2          1 
1         2
>>> 
>>> 4    44                 66               7            0          2 
2        2
>>> 
>>> 5   44                  66               4            1          1 
0       1
>>> 
>>> 6   44                  66                9            2          0
1       2
>>> 
>>> .
>>> 
>>> .
>>> 
>>> 600,000
>>> 
>>> 
>>> 
>>> I wanted to find maf() for each column(from 4 to 600) after
calculating the frequency of the 3 values (0,1,2) but this should be done group
by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>>> 
>>> 
>>> I can do the analysis  for the entire column but not group by group
like this :
>>> 
>>> 
>>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>>> 
>>> How can I modify this code to tell R to do the analysis group by
group for each column so I get maf value for 33-55 group of clolumn I, then maf
value for group 44-66 in the same column I,then the rest of groups in this
column and do the same for the remaining columns.
>>> 
>>> In fact, I'm interested in doing this analysis for only 300
columns but all of the 600 columns.
>>> I have another sheet contains names of columns of interest like
this :
>>> 
>>>> ColOfinterest
>>> 
>>> Col
>>> I
>>> IV
>>> V
>>> .
>>> .
>>> 300
>>> 
>>> Any one would help with the best combination of syntax to perform
this complex analysis?
>>> 
>>> Regards
>>> Allaisone
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>       [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.' 
-Gehm's Corollary to Clarke's Third Law

Allaisone 1

2017-Nov-18 21:40 UTC

head link

[R] Complicated analysis for huge databases

The loop :


AllMAFs <- list()

 for (i in length(SeparatedGroupsofmealsCombs) {
  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf(
tabulate( x+1) ))
}


gives these errors (I tried this many times and I'm sure I copied it
entirely) :-

Error in apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x) maf(tabulate(x +
:
  object 'i' not found>  }Error: unexpected '}' in " }"


The lapply function :
  results<-lapply(SeparatedGroupsofmealsCombs ,
function(x)maf(tabulate(x+1)))
gives this error :-
Error in FUN(left, right) : non-numeric argument to binary operator

I have been trying since yesterday but but until now I'm not able to
identify
the correct syntax.




________________________________
From: David Winsemius <dwinsemius at comcast.net>
Sent: 18 November 2017 20:06:56
To: Allaisone 1
Cc: Boris Steipe; R-help
Subject: Re: [R] Complicated analysis for huge databases

> On Nov 18, 2017, at 1:52 AM, Allaisone 1 <allaisone1 at hotmail.com>
wrote:
>
> Although the loop seems to be formulated correctly I wonder why
> it gives me these errors :
>
> -object 'i' not found
> - unexpected '}' in "}"
You probably did not copy the entire code offered. But we cannot know since you
did not "show your code", not=r did you post complete error messages.
Both of these practices are strongly recommended by the Posting Guide. Please
read it (again?).

--
David.>
>
> the desired output is expected to be very large as for each dataframe in
the list of dataframes I expect to see maf value for each of the 600 columns!
and this is only for
>
> for one dataframe in the list .. I have around 150-200 dataframes.. not
sure how R will store these results.. but first I need the analysis to be done
correctly. The final output has to be something like this :-
>
>
>> mafsforeachcolumns(I,II,...600)foreachcombination
>
>      MealsCombinations    Cust.ID      I              II            III    
IV       ...... 600
> 1          33-55                          1             0.124      0.10    
0.65       0.467
>                                                  3
>                                                  5
>
> 2      44-66                                7           0.134     0.43     
0.64       0.479
>                                                  4
>                                                  9
>
> .
>
> .
>
> ~180 dataframes
>
>
> ________________________________
> From: Boris Steipe <boris.steipe at utoronto.ca>
> Sent: 18 November 2017 00:35:16
> To: Allaisone 1; R-help
> Subject: Re: [R] Complicated analysis for huge databases
>
> Something like the following?
>
> AllMAFs <- list()
>
> for (i in length(SeparatedGroupsofmealsCombs) {
>  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2,
function(x)maf( tabulate( x+1) ))
> }
>
>
> (untested, of course)
> Also the solution is a bit generic since I don't know what the output
of maf() looks like in your case, and I don't understand why you use
tabulate because I would have assumed that's what maf() does - but
that's not for me to worry about :-)
>
>
>
> B.
>
>
>
>> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <allaisone1 at
hotmail.com> wrote:
>>
>>
>> Thanks Boris , this was very helpful but I'm struggling with the
last part.
>>
>> 1) I combined the first 2 columns :-
>>
>>
>> library(tidyr)
>> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB),
remove=FALSE)
>> SingleMealsCode <- SingleMealsCode[,-2]
>>
>>  2) I separated this dataframe into different dataframes based on
"MealsCombination"
>>   column so R will recognize each meal combination separately :
>>
>> SeparatedGroupsofmealsCombs <-
split(SingleMealCode,SingleMealCode$MealsCombinations)
>>
>> after investigating the structure of
"SeparatedGroupsofmealsCombs" , I can see
>> a list of different databases, each of which represents a different
Meal combinations which is great.
>>
>> No, I'm struggling with the last part, how can I run the maf code
for all dataframes?
>>
>> when I run this code as before :-
>>
>> maf <- apply(SeparatedGroupsofmealsCombs, 2,
function(x)maf(tabulate(x+1)))
>>
>> an error message says : dim(X) must have a positive length . I'm
not sure which length
>> I need to specify.. any suggestions to correct this syntax ?
>>
>> Regards
>> Allaisone
>> From: Boris Steipe <boris.steipe at utoronto.ca>
>> Sent: 17 November 2017 21:12:06
>> To: Allaisone 1
>> Cc: R-help
>> Subject: Re: [R] Complicated analysis for huge databases
>>
>> Combine columns 1 and 2 into a column with a single ID like
"33.55", "44.66" and use split() on these IDs to break up
your dataset. Iterate over the list of data frames split() returns.
>>
>>
>> B.
>>
>>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <allaisone1 at
hotmail.com> wrote:
>>>
>>>
>>> Hi all ..,
>>>
>>>
>>> I have a large dataset of around 600,000 rows and 600 columns. The
first col is codes for Meal A, the second columns is codes for Meal B. The third
column is customers IDs where each customer had a combination of meals. Each
column of the rest columns contains values 0,1,or 2. The dataset is organised in
a way so that the first group of customers had similar meals combinations, this
is followed by another group of customers with similar meals combinations but
different from the first group and so on. The dataset looks like this :-
>>>
>>>
>>>> MyData
>>>
>>>      Meal A     Meal B     Cust.ID      I            II        III 
IV   ...... 600
>>>
>>> 1    33                 55             1             0           1 
2       0
>>>
>>> 2    33                 55              3             1          0 
2        2
>>>
>>> 3    33                 55              5             2          1 
1         2
>>>
>>> 4    44                 66               7            0          2 
2        2
>>>
>>> 5   44                  66               4            1          1 
0       1
>>>
>>> 6   44                  66                9            2          0
1       2
>>>
>>> .
>>>
>>> .
>>>
>>> 600,000
>>>
>>>
>>>
>>> I wanted to find maf() for each column(from 4 to 600) after
calculating the frequency of the 3 values (0,1,2) but this should be done group
by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>>>
>>>
>>> I can do the analysis  for the entire column but not group by group
like this :
>>>
>>>
>>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>>>
>>> How can I modify this code to tell R to do the analysis group by
group for each column so I get maf value for 33-55 group of clolumn I, then maf
value for group 44-66 in the same column I,then the rest of groups in this
column and do the same for the remaining columns.
>>>
>>> In fact, I'm interested in doing this analysis for only 300
columns but all of the 600 columns.
>>> I have another sheet contains names of columns of interest like
this :
>>>
>>>> ColOfinterest
>>>
>>> Col
>>> I
>>> IV
>>> V
>>> .
>>> .
>>> 300
>>>
>>> Any one would help with the best combination of syntax to perform
this complex analysis?
>>>
>>> Regards
>>> Allaisone
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.' 
-Gehm's Corollary to Clarke's Third Law






	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more reasonably related threads

R help - Nov 2017 - Complicated analysis for huge databases

[R] Complicated analysis for huge databases

[R] Complicated analysis for huge databases

[R] Complicated analysis for huge databases

Possibly Parallel Threads