thr3ads.net - R help - [R] Frequency of Combinations [Apr 2017]

If this information is useful, please help other people find it:
Share via:

abo dalash

2017-Apr-24 05:07 UTC

[R] Frequency of Combinations

Hi there

I have data set with 500,000 patients (rows) and the first column is Patient
I'D Number, columns from 2 to 20 are Drug1,Drug 2,...,Drug 20 so each row
starts with the patient ID and the remaining of cells in the row are codes for
names of the treatments taken by the patient. Number of treatments differ
between patients. For example, there are patients with 3 treatments only and
patients with 20 drugs. The unique number of treatments in the entire data set
is about 6700 drugs. However, I'm interested in studying only 128 drugs,
these drugs are listed in a second sheet as code numbers associated with their
meanings (names of drugs representing the code). I'm interested in
identifying the most frequently used DRUG COMBINATIONS between only the 128
drugs among the 6700 drugs. The structure of the Excell file to be used in
analysis is like this:

-Sheet 1( the entire data set):-
1   Patient ID    Drug1    Drug2 ....    Drug 20.
2  1125            45            46                55
3  1126              60           55                 45
.
.
500,000

-Sheet 2 (list of codes meanings for only the drugs of interest):

1    Drug code             meaning
2    45                            Simvastatin
3    55                            Aspirin
4    60                            Paracetamol
.
128

The desired output I'm looking for :

 Drug codes      Meaning          Frequency
45+55               Simvastatin              2
                         +Aspirin
60+55             Aspirin+                       1
                        Paracetamol
60+45             Simvastatin+                1
                        Paracetamol

Please note the the final output does not include any combination containing
drug 46 as this is not in the list of drugs preferred to be studied which are
mentioned in sheet 2.

Could you please help me which R codes and packages should be used to run this
analyisis?

Regards
Mustafa

	[[alternative HTML version deleted]]

Thierry Onkelinx

2017-Apr-24 14:43 UTC

head link

[R] Frequency of Combinations

Dear Mustafa,

I'd recommend the packages readxls to import the data, tidyr to transform
the data into long format and dplyr to select the data.

1. read the data into R with read_excel()
2. transform sheet 1 into a long format with gather(). The result is one
row for each patient / drug combination
3. select the relevant drugs in sheet 2 with filter()
4. join long sheet 1 and filtered sheet2 with inner_join()
5. summarise() the drug codes and names after group_by(patient_id)
6. count() the number of drug codes.

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2017-04-24 7:07 GMT+02:00 abo dalash <abo_dlsh at hotmail.com>:
> Hi there
>
> I have data set with 500,000 patients (rows) and the first column is
> Patient I'D Number, columns from 2 to 20 are Drug1,Drug 2,...,Drug 20
so
> each row starts with the patient ID and the remaining of cells in the row
> are codes for names of the treatments taken by the patient. Number of
> treatments differ between patients. For example, there are patients with 3
> treatments only and patients with 20 drugs. The unique number of treatments
> in the entire data set is about 6700 drugs. However, I'm interested in
> studying only 128 drugs, these drugs are listed in a second sheet as code
> numbers associated with their meanings (names of drugs representing the
> code). I'm interested in identifying the most frequently used DRUG
> COMBINATIONS between only the 128 drugs among the 6700 drugs. The structure
> of the Excell file to be used in analysis is like this:
>
> -Sheet 1( the entire data set):-
> 1   Patient ID    Drug1    Drug2 ....    Drug 20.
> 2  1125            45            46                55
> 3  1126              60           55                 45
> .
> .
> 500,000
>
> -Sheet 2 (list of codes meanings for only the drugs of interest):
>
> 1    Drug code             meaning
> 2    45                            Simvastatin
> 3    55                            Aspirin
> 4    60                            Paracetamol
> .
> 128
>
> The desired output I'm looking for :
>
>  Drug codes      Meaning          Frequency
> 45+55               Simvastatin              2
>                          +Aspirin
> 60+55             Aspirin+                       1
>                         Paracetamol
> 60+45             Simvastatin+                1
>                         Paracetamol
>
> Please note the the final output does not include any combination
> containing drug 46 as this is not in the list of drugs preferred to be
> studied which are mentioned in sheet 2.
>
> Could you please help me which R codes and packages should be used to run
> this analyisis?
>
> Regards
> Mustafa
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Boris Steipe

2017-Apr-24 16:47 UTC

head link

[R] Frequency of Combinations

This can be easily done in base R.

The solution below is pedestrian, transparent and explicit. Thus it's easy
to
debug and validate(!) each step.

Prepare:
(1) Save your Excel spreadsheet as a text file with tab-separated values.
(2) Initialize a 128 * 128 matrix to hold your results. It should have
    your drug-codes as row names and as column names. 
(3) Create a vector of drug codes as character(!) (not as integers). Treating
them
    as strings avoids being bitten by whatever unexpected stuff you'll find
in your data.

Compile:
(4) read your data line by line. For each line:
       use strsplit() to get a vector of contents of your spreadsheet cells,
         drop the patient ID while you are doing this (IDs could overlap drug
codes).
       use match() to compare your drug codes with the vector of contents.
match()
         returns a vector of positions. 
       if the length of this vector is > 1:
         use a nested loop to consider all combinations of matched drugs. For
each
         combination of drug codes, increment the corresponding value in your
matrix.
            
(5) Now all counts are in the matrix, but the same combination may appear
    in different order: add the value of each element below the diagonal
    of your matrix to the corresponding value above the diagonal.
    (The values _in_ the diagonal would correspond to drugs that have been
     entered more than once.)

(6) Finally, replace the matrix row- and column names with the actual drug
names.

Analyze:
There are many ways. Myself, I would create a data frame with the counts in one
column,
and the two drug names in the second and third column, then order() the rows
by count.

All in all, that's about 20 lines of code to prepare and compile, 10 more
for the
result. Post again if this wasn't clear, or if you need help with the actual
syntax.


Cheers,
B.    

> On Apr 24, 2017, at 1:07 AM, abo dalash <abo_dlsh at hotmail.com>
wrote:
> 
> Hi there
> 
> I have data set with 500,000 patients (rows) and the first column is
Patient I'D Number, columns from 2 to 20 are Drug1,Drug 2,...,Drug 20 so
each row starts with the patient ID and the remaining of cells in the row are
codes for names of the treatments taken by the patient. Number of treatments
differ between patients. For example, there are patients with 3 treatments only
and patients with 20 drugs. The unique number of treatments in the entire data
set is about 6700 drugs. However, I'm interested in studying only 128 drugs,
these drugs are listed in a second sheet as code numbers associated with their
meanings (names of drugs representing the code). I'm interested in
identifying the most frequently used DRUG COMBINATIONS between only the 128
drugs among the 6700 drugs. The structure of the Excell file to be used in
analysis is like this:
> 
> -Sheet 1( the entire data set):-
> 1   Patient ID    Drug1    Drug2 ....    Drug 20.
> 2  1125            45            46                55
> 3  1126              60           55                 45
> .
> .
> 500,000
> 
> -Sheet 2 (list of codes meanings for only the drugs of interest):
> 
> 1    Drug code             meaning
> 2    45                            Simvastatin
> 3    55                            Aspirin
> 4    60                            Paracetamol
> .
> 128
> 
> The desired output I'm looking for :
> 
> Drug codes      Meaning          Frequency
> 45+55               Simvastatin              2
>                         +Aspirin
> 60+55             Aspirin+                       1
>                        Paracetamol
> 60+45             Simvastatin+                1
>                        Paracetamol
> 
> Please note the the final output does not include any combination
containing drug 46 as this is not in the list of drugs preferred to be studied
which are mentioned in sheet 2.
> 
> Could you please help me which R codes and packages should be used to run
this analyisis?
> 
> Regards
> Mustafa
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Thierry Onkelinx

2017-Apr-25 11:22 UTC

head link

[R] Frequency of Combinations

Dear Mustafa,

Please keep the mailing list in cc.

Since you claim to have written the code, you can share the code so we can
review it. That makes more sense than having us to write code for you...

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2017-04-24 19:24 GMT+02:00 abo dalash <abo_dlsh at hotmail.com>:
> Dear Thierry
>
> Many thanks for your cooperation. I'm trying to apply the steps you
have
> mentioned. If you don't mind, could you please type the entire codes
after
> these functions so I can make sure that I have done everything correctly. I
> mean the details inside the ().
>
> Many thanks
>  Regards
>
>
>
> Sent from my Samsung device
>
>
> -------- Original message --------
> From: Thierry Onkelinx <thierry.onkelinx at inbo.be>
> Date: 24/04/2017 3:43 p.m. (GMT+00:00)
> To: abo dalash <abo_dlsh at hotmail.com>
> Cc: "r-help at R-project.org" <r-help at r-project.org>
> Subject: Re: [R] Frequency of Combinations
>
> Dear Mustafa,
>
> I'd recommend the packages readxls to import the data, tidyr to
transform
> the data into long format and dplyr to select the data.
>
> 1. read the data into R with read_excel()
> 2. transform sheet 1 into a long format with gather(). The result is one
> row for each patient / drug combination
> 3. select the relevant drugs in sheet 2 with filter()
> 4. join long sheet 1 and filtered sheet2 with inner_join()
> 5. summarise() the drug codes and names after group_by(patient_id)
> 6. count() the number of drug codes.
>
> Best regards,
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
> Forest
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality
Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
>
> 2017-04-24 7:07 GMT+02:00 abo dalash <abo_dlsh at hotmail.com>:
>
>> Hi there
>>
>> I have data set with 500,000 patients (rows) and the first column is
>> Patient I'D Number, columns from 2 to 20 are Drug1,Drug 2,...,Drug
20 so
>> each row starts with the patient ID and the remaining of cells in the
row
>> are codes for names of the treatments taken by the patient. Number of
>> treatments differ between patients. For example, there are patients
with 3
>> treatments only and patients with 20 drugs. The unique number of
treatments
>> in the entire data set is about 6700 drugs. However, I'm interested
in
>> studying only 128 drugs, these drugs are listed in a second sheet as
code
>> numbers associated with their meanings (names of drugs representing the
>> code). I'm interested in identifying the most frequently used DRUG
>> COMBINATIONS between only the 128 drugs among the 6700 drugs. The
structure
>> of the Excell file to be used in analysis is like this:
>>
>> -Sheet 1( the entire data set):-
>> 1   Patient ID    Drug1    Drug2 ....    Drug 20.
>> 2  1125            45            46                55
>> 3  1126              60           55                 45
>> .
>> .
>> 500,000
>>
>> -Sheet 2 (list of codes meanings for only the drugs of interest):
>>
>> 1    Drug code             meaning
>> 2    45                            Simvastatin
>> 3    55                            Aspirin
>> 4    60                            Paracetamol
>> .
>> 128
>>
>> The desired output I'm looking for :
>>
>>  Drug codes      Meaning          Frequency
>> 45+55               Simvastatin              2
>>                          +Aspirin
>> 60+55             Aspirin+                       1
>>                         Paracetamol
>> 60+45             Simvastatin+                1
>>                         Paracetamol
>>
>> Please note the the final output does not include any combination
>> containing drug 46 as this is not in the list of drugs preferred to be
>> studied which are mentioned in sheet 2.
>>
>> Could you please help me which R codes and packages should be used to
run
>> this analyisis?
>>
>> Regards
>> Mustafa
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
	[[alternative HTML version deleted]]

R help - Apr 2017 - Frequency of Combinations

[R] Frequency of Combinations

[R] Frequency of Combinations

[R] Frequency of Combinations

[R] Frequency of Combinations