thr3ads.net - R help - [R] sqldf not joining all the fields [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Newbie19_02

2010-Mar-12 16:56 UTC

[R] sqldf not joining all the fields

Dear R users,

I have two data frames that were read from text files as follows:

x_data <- read.table("x.txt", header = TRUE, sep = "|",
quote = "\"'",
                dec = ".",as.is = TRUE,na.strings =
"NA",colClasses = NA,
nrows = 3864284,
                skip = 0, check.names = TRUE,fill=TRUE,
                strip.white = TRUE, blank.lines.skip = TRUE,
                comment.char = "#", allowEscapes = FALSE, flush =
FALSE,
                fileEncoding = "", encoding = "unknown")

x_data

prochi prescribed_date dataMonth item_code res_seqno quantity directions
CAO0000713      22/06/2001      NULL    842752      NULL       60        1/D
CAO0000713      28/04/2000      NULL      7800      NULL     100G       A/TD
CAO0000713      10/04/2000      NULL    842652      NULL       60        1/D
CAO0000713      03/07/2000      NULL    842652      NULL       60        1/D
CAO0000713      09/01/2001      NULL    842752      NULL       60        1/D
CAO0000713      16/10/2001      NULL    842752      NULL       60        1/D
CAO0000713      16/08/2001      NULL    842752      NULL       60        1/D
CAO0000713      17/09/1993      NULL     39620      NULL      5ML        NIL
CAO0000713      01/05/2001      NULL    842752      NULL       60        1/D
CAO0000713      05/03/2001      NULL    842752      NULL       60        1/D



y_data

 item_code    name                              formulation_code  strength
bnf_code
100              NEONACLEX K                    TABS      NULL    2.2.8
110                NEONACLEX                     TABS       5MG    2.2.1
50                   MESORB                         DRESS 10CMX10CM   20.3.1
160 ABSORBENT CELLULOSE MESO            DRESS 10CMX10CM   20.3.1
161 ABSORBENT CELLULOSE MESO            DRESS 10CMX15CM   20.3.1
164 ABSORBENT CELLULOSE MESO            DRESS 20CMX25CM   20.3.1
200                  SEPTRIN                        TABS     480MG    5.1.8
210          SEPTRIN PAED SF                    SUSP 240MG/5ML    5.1.8
212            SEPTRIN ADULT                     SUSP 480MG/5ML    5.1.8
220            SEPTRIN FORTE                     TABS     960MG    5.1.8
 etc....


contains all the information for the item codes
y was read in in the same way.

I then used the following code:

z  <- sqldf("select * from x left join y using (code)")

when I use this on my real data I get an output:
 prochi prescribed_date dataMonth item_code res_seqno quantity directions
1  CAO0000713      22/06/2001      NULL    842752      NULL       60       
1/D
2  CAO0000713      28/04/2000      NULL      7800      NULL     100G      
A/TD
3  CAO0000713      10/04/2000      NULL    842652      NULL       60       
1/D
4  CAO0000713      03/07/2000      NULL    842652      NULL       60       
1/D
5  CAO0000713      09/01/2001      NULL    842752      NULL       60       
1/D
6  CAO0000713      16/10/2001      NULL    842752      NULL       60       
1/D
7  CAO0000713      16/08/2001      NULL    842752      NULL       60       
1/D
8  CAO0000713      17/09/1993      NULL     39620      NULL      5ML       
NIL
9  CAO0000713      01/05/2001      NULL    842752      NULL       60       
1/D
10 CAO0000713      05/03/2001      NULL    842752      NULL       60       
1/D
   no_of_packs datasource scan_ref_no         name formulation_code strength
1         NULL        TSF        NULL         <NA>             <NA> 
<NA>
2         NULL        TSF        NULL BETNOVATE RD             OINT   0.025%
3         NULL        TSF        NULL         <NA>             <NA> 
<NA>
4         NULL        TSF        NULL         <NA>             <NA> 
<NA>
5         NULL        TSF        NULL         <NA>             <NA> 
<NA>
6         NULL        TSF        NULL         <NA>             <NA> 
<NA>
7         NULL        TSF        NULL         <NA>             <NA> 
<NA>
8         NULL        TSF        NULL   GAMMABULIN              INJ    320MG
9         NULL        TSF        NULL         <NA>             <NA> 
<NA>
10        NULL        TSF        NULL         <NA>             <NA> 
<NA>
   bnf_code
1      <NA>
2  13.4.1.2
3      <NA>
4      <NA>
5      <NA>
6      <NA>
7      <NA>
8      14.5
9      <NA>
10     <NA>


There is absolutely no reason for there to be <NA> anywhere as the
information for both the tables is complete.

Not sure what the problem is?

Thanks,
Natalie
-- 
View this message in context:
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590786.html
Sent from the R help mailing list archive at Nabble.com.

Gabor Grothendieck

2010-Mar-12 17:01 UTC

head link

[R] sqldf not joining all the fields

Can you show the output of dput(x_data) and dput(y_data).

On Fri, Mar 12, 2010 at 11:56 AM, Newbie19_02 <nvanzuydam at gmail.com>
wrote:>
> Dear R users,
>
> I have two data frames that were read from text files as follows:
>
> x_data <- read.table("x.txt", header = TRUE, sep =
"|", quote = "\"'",
> ? ? ? ? ? ? ? ?dec = ".",as.is = TRUE,na.strings =
"NA",colClasses = NA,
> nrows = 3864284,
> ? ? ? ? ? ? ? ?skip = 0, check.names = TRUE,fill=TRUE,
> ? ? ? ? ? ? ? ?strip.white = TRUE, blank.lines.skip = TRUE,
> ? ? ? ? ? ? ? ?comment.char = "#", allowEscapes = FALSE, flush =
FALSE,
> ? ? ? ? ? ? ? ?fileEncoding = "", encoding = "unknown")
>
> x_data
>
> prochi prescribed_date dataMonth item_code res_seqno quantity directions
> CAO0000713 ? ? ?22/06/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ?
?1/D
> CAO0000713 ? ? ?28/04/2000 ? ? ?NULL ? ? ?7800 ? ? ?NULL ? ? 100G ? ? ?
A/TD
> CAO0000713 ? ? ?10/04/2000 ? ? ?NULL ? ?842652 ? ? ?NULL ? ? ? 60 ? ? ?
?1/D
> CAO0000713 ? ? ?03/07/2000 ? ? ?NULL ? ?842652 ? ? ?NULL ? ? ? 60 ? ? ?
?1/D
> CAO0000713 ? ? ?09/01/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ?
?1/D
> CAO0000713 ? ? ?16/10/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ?
?1/D
> CAO0000713 ? ? ?16/08/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ?
?1/D
> CAO0000713 ? ? ?17/09/1993 ? ? ?NULL ? ? 39620 ? ? ?NULL ? ? ?5ML ? ? ?
?NIL
> CAO0000713 ? ? ?01/05/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ?
?1/D
> CAO0000713 ? ? ?05/03/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ?
?1/D
>
>
>
> y_data
>
> ?item_code ? ?name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?formulation_code ?strength
> bnf_code
> 100 ? ? ? ? ? ? ?NEONACLEX K ? ? ? ? ? ? ? ? ? ?TABS ? ? ?NULL ? ?2.2.8
> 110 ? ? ? ? ? ? ? ?NEONACLEX ? ? ? ? ? ? ? ? ? ? TABS ? ? ? 5MG ? ?2.2.1
> 50 ? ? ? ? ? ? ? ? ? MESORB ? ? ? ? ? ? ? ? ? ? ? ? DRESS 10CMX10CM ?
20.3.1
> 160 ABSORBENT CELLULOSE MESO ? ? ? ? ? ?DRESS 10CMX10CM ? 20.3.1
> 161 ABSORBENT CELLULOSE MESO ? ? ? ? ? ?DRESS 10CMX15CM ? 20.3.1
> 164 ABSORBENT CELLULOSE MESO ? ? ? ? ? ?DRESS 20CMX25CM ? 20.3.1
> 200 ? ? ? ? ? ? ? ? ?SEPTRIN ? ? ? ? ? ? ? ? ? ? ? ?TABS ? ? 480MG ? ?5.1.8
> 210 ? ? ? ? ?SEPTRIN PAED SF ? ? ? ? ? ? ? ? ? ?SUSP 240MG/5ML ? ?5.1.8
> 212 ? ? ? ? ? ?SEPTRIN ADULT ? ? ? ? ? ? ? ? ? ? SUSP 480MG/5ML ? ?5.1.8
> 220 ? ? ? ? ? ?SEPTRIN FORTE ? ? ? ? ? ? ? ? ? ? TABS ? ? 960MG ? ?5.1.8
> ?etc....
>
>
> contains all the information for the item codes
> y was read in in the same way.
>
> I then used the following code:
>
> z ?<- sqldf("select * from x left join y using (code)")
>
> when I use this on my real data I get an output:
> ?prochi prescribed_date dataMonth item_code res_seqno quantity directions
> 1 ?CAO0000713 ? ? ?22/06/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60
> 1/D
> 2 ?CAO0000713 ? ? ?28/04/2000 ? ? ?NULL ? ? ?7800 ? ? ?NULL ? ? 100G
> A/TD
> 3 ?CAO0000713 ? ? ?10/04/2000 ? ? ?NULL ? ?842652 ? ? ?NULL ? ? ? 60
> 1/D
> 4 ?CAO0000713 ? ? ?03/07/2000 ? ? ?NULL ? ?842652 ? ? ?NULL ? ? ? 60
> 1/D
> 5 ?CAO0000713 ? ? ?09/01/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60
> 1/D
> 6 ?CAO0000713 ? ? ?16/10/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60
> 1/D
> 7 ?CAO0000713 ? ? ?16/08/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60
> 1/D
> 8 ?CAO0000713 ? ? ?17/09/1993 ? ? ?NULL ? ? 39620 ? ? ?NULL ? ? ?5ML
> NIL
> 9 ?CAO0000713 ? ? ?01/05/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60
> 1/D
> 10 CAO0000713 ? ? ?05/03/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60
> 1/D
> ? no_of_packs datasource scan_ref_no ? ? ? ? name formulation_code strength
> 1 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ?
<NA> ? ? <NA>
> 2 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL BETNOVATE RD ? ? ? ? ? ? OINT ?
0.025%
> 3 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ?
<NA> ? ? <NA>
> 4 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ?
<NA> ? ? <NA>
> 5 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ?
<NA> ? ? <NA>
> 6 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ?
<NA> ? ? <NA>
> 7 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ?
<NA> ? ? <NA>
> 8 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? GAMMABULIN ? ? ? ? ? ? ?INJ ?
?320MG
> 9 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ?
<NA> ? ? <NA>
> 10 ? ? ? ?NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ?
<NA> ? ? <NA>
> ? bnf_code
> 1 ? ? ?<NA>
> 2 ?13.4.1.2
> 3 ? ? ?<NA>
> 4 ? ? ?<NA>
> 5 ? ? ?<NA>
> 6 ? ? ?<NA>
> 7 ? ? ?<NA>
> 8 ? ? ?14.5
> 9 ? ? ?<NA>
> 10 ? ? <NA>
>
>
> There is absolutely no reason for there to be <NA> anywhere as the
> information for both the tables is complete.
>
> Not sure what the problem is?
>
> Thanks,
> Natalie
> --
> View this message in context:
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590786.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Newbie19_02

2010-Mar-12 17:06 UTC

head link

[R] sqldf not joining all the fields

http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt
feb09_267_presc_items_tsf.txt 

is the total file for y so if I use the command line with the total data for
y then I get the output specified in z

Thanks,
Natalie
-- 
View this message in context:
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590804.html
Sent from the R help mailing list archive at Nabble.com.

Newbie19_02

2010-Mar-12 17:14 UTC

head link

[R] sqldf not joining all the fields

dput(x_data)

structure(list(prochi = c("CAO0000713", "CAO0000713",
"CAO0000713",
"CAO0000713", "CAO0000713", "CAO0000713",
"CAO0000713", "CAO0000713",
"CAO0000713", "CAO0000713"), prescribed_date =
c("22/06/2001",
"28/04/2000", "10/04/2000", "03/07/2000",
"09/01/2001", "16/10/2001",
"16/08/2001", "17/09/1993", "01/05/2001",
"05/03/2001"), dataMonth c("NULL",
"NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL",
"NULL"), item_code = c("842752", "7800",
"842652", "842652",
"842752", "842752", "842752", "39620",
"842752", "842752"), res_seqno c("NULL",
"NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL",
"NULL"), quantity = c("60", "100G",
"60", "60", "60", "60", "60",
"5ML", "60", "60"), directions =
c("1/D", "A/TD", "1/D", "1/D",
"1/D", "1/D", "1/D", "NIL",
"1/D", "1/D"), no_of_packs = c("NULL",
"NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL",
"NULL"), datasource = c("TSF", "TSF",
"TSF", "TSF", "TSF", "TSF",
"TSF", "TSF", "TSF", "TSF"), scan_ref_no
= c("NULL", "NULL",
"NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL"
)), .Names = c("prochi", "prescribed_date",
"dataMonth", "item_code",
"res_seqno", "quantity", "directions",
"no_of_packs", "datasource",
"scan_ref_no"), row.names = c(NA, 10L), class =
"data.frame")

-- 
View this message in context:
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590821.html
Sent from the R help mailing list archive at Nabble.com.

Gabor Grothendieck

2010-Mar-12 17:15 UTC

head link

[R] sqldf not joining all the fields

Please provide code that I can just copy from your post and paste into
my session.   Either provide dput output as requested or provide the
files on the internet together with code that reads them off the
internet.

On Fri, Mar 12, 2010 at 12:06 PM, Newbie19_02 <nvanzuydam at gmail.com>
wrote:>
> http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt
> feb09_267_presc_items_tsf.txt
>
> is the total file for y so if I use the command line with the total data
for
> y then I get the output specified in z
>
> Thanks,
> Natalie
> --
> View this message in context:
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590804.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Gabor Grothendieck

2010-Mar-12 17:16 UTC

head link

[R] sqldf not joining all the fields

What  about y_data?

On Fri, Mar 12, 2010 at 12:14 PM, Newbie19_02 <nvanzuydam at gmail.com>
wrote:>
> dput(x_data)
>
> structure(list(prochi = c("CAO0000713", "CAO0000713",
"CAO0000713",
> "CAO0000713", "CAO0000713", "CAO0000713",
"CAO0000713", "CAO0000713",
> "CAO0000713", "CAO0000713"), prescribed_date =
c("22/06/2001",
> "28/04/2000", "10/04/2000", "03/07/2000",
"09/01/2001", "16/10/2001",
> "16/08/2001", "17/09/1993", "01/05/2001",
"05/03/2001"), dataMonth > c("NULL",
> "NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL",
> "NULL"), item_code = c("842752", "7800",
"842652", "842652",
> "842752", "842752", "842752",
"39620", "842752", "842752"), res_seqno >
c("NULL",
> "NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL",
> "NULL"), quantity = c("60", "100G",
"60", "60", "60", "60", "60",
> "5ML", "60", "60"), directions =
c("1/D", "A/TD", "1/D", "1/D",
> "1/D", "1/D", "1/D", "NIL",
"1/D", "1/D"), no_of_packs = c("NULL",
> "NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL",
> "NULL"), datasource = c("TSF", "TSF",
"TSF", "TSF", "TSF", "TSF",
> "TSF", "TSF", "TSF", "TSF"),
scan_ref_no = c("NULL", "NULL",
> "NULL", "NULL", "NULL", "NULL",
"NULL", "NULL", "NULL", "NULL"
> )), .Names = c("prochi", "prescribed_date",
"dataMonth", "item_code",
> "res_seqno", "quantity", "directions",
"no_of_packs", "datasource",
> "scan_ref_no"), row.names = c(NA, 10L), class =
"data.frame")
>
> --
> View this message in context:
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590821.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Natalie Van Zuydam

2010-Mar-12 17:51 UTC

head link

[R] Fwd: sqldf not joining all the fields

---------- Forwarded message ----------
From: Natalie Van Zuydam <nvanzuydam@gmail.com>
Date: Fri, Mar 12, 2010 at 5:49 PM
Subject: Re: [R] sqldf not joining all the fields
To: David Winsemius <dwinsemius@comcast.net>


Dear David

I'm not sure what the problem is as for every item code there is a
corresponding information in the y_data.  For example 842752 from the x_data
corresponds to Aspirin in the y_data?  Yet when I use sqldf to join the two
df's I get NA values in the columns from the y_data in z for 842752 item
code....is there something wrong with my sqldf code or something wrong with
the way I have inputed the data frames?

Thanks for taking the time to help me,
Natalie



On Fri, Mar 12, 2010 at 5:42 PM, David Winsemius
<dwinsemius@comcast.net>wrote:
> If I assign the file input to y_data and change you sqldf to
>
> > z  <- sqldf("select * from x_data left join y_data using
(item_code)"); z
>
> I can replicate your result. Even after changing the types of the two
> item_code fields to match I still get the same result and when I see to
what
> degree they share values I get:
>
> > sum(x_data$item_code %in% y_data$item_code)
> [1] 2
> > sum(y_data$item_code %in% x_data$item_code)
> [1] 2
>
>
> So why are you so sure they are "complete" as you claimed in your
first
> email.
>
> --
> David.
>
>
>
> On Mar 12, 2010, at 12:29 PM, David Winsemius wrote:
>
>  You have now given two different assignments to x_data and none to y_data:
>>
>> The str( from the file access offering:
>>
>> > str(x_data)
>> 'data.frame':   2848 obs. of  5 variables:
>> $ item_code       : int  100 110 150 160 161 164 200 210 212 220 ...
>> $ name            : chr  "NEONACLEX K" "NEONACLEX"
"MESORB" "ABSORBENT
>> CELLULOSE MESO" ...
>> $ formulation_code: chr  "TABS" "TABS"
"DRESS" "DRESS" ...
>> $ strength        : chr  "NULL" "5MG"
"10CMX10CM" "10CMX10CM" ...
>> $ bnf_code        : chr  "2.2.8" "2.2.1"
"20.3.1" "20.3.1" ...
>>
>> The str from assignment from the dput offering
>> > str(x_data)
>> 'data.frame':   10 obs. of  10 variables:
>> $ prochi         : chr  "CAO0000713" "CAO0000713"
"CAO0000713"
>> "CAO0000713" ...
>> $ prescribed_date: chr  "22/06/2001" "28/04/2000"
"10/04/2000"
>> "03/07/2000" ...
>> $ dataMonth      : chr  "NULL" "NULL"
"NULL" "NULL" ...
>> $ item_code      : chr  "842752" "7800"
"842652" "842652" ...
>> $ res_seqno      : chr  "NULL" "NULL"
"NULL" "NULL" ...
>> $ quantity       : chr  "60" "100G" "60"
"60" ...
>> $ directions     : chr  "1/D" "A/TD"
"1/D" "1/D" ...
>> $ no_of_packs    : chr  "NULL" "NULL"
"NULL" "NULL" ...
>> $ datasource     : chr  "TSF" "TSF" "TSF"
"TSF" ...
>> $ scan_ref_no    : chr  "NULL" "NULL"
"NULL" "NULL" ...
>>
>> This code "worked", but it is not clear that the x-y
assignments were
>> correct:
>>
>> x_data <- read.table(file="
>> http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt",
header
>> = TRUE, sep = "|", quote = "\"'",
>>              dec = ".",as.is = TRUE,na.strings =
"NA",colClasses = NA,
>> nrows = 3864284,
>>              skip = 0, check.names = TRUE,fill=TRUE,
>>              strip.white = TRUE, blank.lines.skip = TRUE,
>>              comment.char = "#", allowEscapes = FALSE, flush
= FALSE,
>>              fileEncoding = "", encoding =
"unknown")
>>
>> --
>> David.
>>
>> On Mar 12, 2010, at 12:23 PM, Newbie19_02 wrote:
>>
>>
>>> The y_data file has over 9000 rows in it so I thought it would be
more
>>> practical to give you the file to download....
>>> --
>>> View this message in context:
>>>
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590833.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> David Winsemius, MD
> West Hartford, CT
>
>
	[[alternative HTML version deleted]]

Natalie Van Zuydam

2010-Mar-12 19:15 UTC

head link

[R] Fwd: sqldf not joining all the fields

Hi,
I did quote the string and as I told read.table to strip.white  I also tried
it with no spaces.
 "ASPIRIN DISP AAH" %in% tsf_data$name
[1] FALSE

"ASPIRINDISPAAH" %in% tsf_data$name
[1] FALSE

I have looked at the last lines of my y_data object and there is a problem
with the file that I'm going to try to use perl to sort out


if I y_data[2847:2848,]
RONIC ACID|TABS|5MG|6.6.2\n652903|ALENDRONIC ACID Once
Wee|TABS|70MG|1.1.1\n653000|AZATHIOPRINE|INJ|50MG|8.2.1\n653200|DORZOLAMIDE
EYE|DROPS|2%|11.6\n653500|TOPIRAMATE|TABS|50MG|4.8.1\n653510|TOPIRAMATE|TABS|100MG|4.8.1\n653513|TOPIRAMATE|TABS|25MG|4.8.1\n653600|GUAR
GUM
SF|SACH|5G|6.1.2.3\n653700|TACALCITOL|OINT|NULL|13.5.2\n654000|COCODAMOL|CAPS|30/500MG|4.7.1\n654010|COCODAMOL|TABS|12.8/500M|4.7.1\n654020|COCODAMOL|SACH|30/500MG|4.7.1\n654300|ACAMPROSATE
CALCIUM|TABS|333MG|4.10\n654400|ACECLOFENAC|TABS|100MG|10.1.1\n654500|TILUDRONIC
ACID|TABS|200MG|6.6.2\n654600|TAMSULOSIN HCL
MR|CAPS|400MCG|7.4.1\n654800|PENCICLOVIR|CREAM|1%|13.10.3\......
     ........formulation_code strength bnf_code
2847              INJ   1MG/ML    3.4.3
2848

The output looks like this so it must be the input files and not an sqldf
problem.  I didn't see this in the script editor I was using.

Natalie

On Fri, Mar 12, 2010 at 7:05 PM, Dennis Murphy <djmuser@gmail.com> wrote:
> Hi:
>
> Part of the problem is that your string contains spaces; in your example
> below, you would need to quote the string, I believe, as in
>
> "ASPIRIN DISP AAH" %in% y$name                       
[untested...]
>
> HTH,
> Dennis
>
>
> On Fri, Mar 12, 2010 at 10:59 AM, Natalie Van Zuydam
<nvanzuydam@gmail.com
> > wrote:
>
>> 842752|ASPIRIN DISP AAH|TABS|75MG|2.9  is taken directly from my y_data
>> text
>> file.
>>
>> If I search for ASPIRIN DISP AAH %in% y$name  I get: FALSE.  Despite
the
>> fact that it is there in the text file that I loaded into y. There must
be
>> a
>> problem with my input.
>>
>> tsf_data <- read.table("feb09_267_presc_items_tsf.txt",
header = TRUE, sep
>> >> "|", quote = "\"'",
>>                dec = ".",as.is = TRUE,na.strings =
"NA",colClasses = NA,
>> nrows = 3864284,
>>                skip = 0, check.names = TRUE,fill=TRUE,
>>                strip.white = TRUE, blank.lines.skip = TRUE,
>>                comment.char = "#", allowEscapes = FALSE,
flush = FALSE,
>>                fileEncoding = "", encoding =
"unknown")
>>
>> Would fill=TRUE and strip.white=TRUE affect how the item_code columns
are
>> matched?  When I look at the file in a script editor I cannot see any
>> strange symbols or formatting?  I have had to use fill otherwise I get
the
>> following error message:
>>
>> Warning message:
>> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>>  number of items read is not a multiple of the number of columns
>>
>> Not sure if correcting this in read.table is affecting the join
function
>> of
>> sqldb?
>>
>> Thanks,
>> Natalie
>>
>>
>>
>> I also tried 842752
>>
>> On Fri, Mar 12, 2010 at 6:05 PM, Gabor Grothendieck <
>> ggrothendieck@gmail.com
>> > wrote:
>>
>> > That is not so.  842752 does not exist in y$item_code and ASPIRIN
has
>> > a code of 22730.
>> >
>> > > 842752 %in% y$item_code
>> > [1] FALSE
>> >
>> > > subset(y, name == "ASPIRIN")
>> >    item_code    name formulation_code strength bnf_code
>> > 850     22730 ASPIRIN             TABS    300MG    4.7.1
>> > 855     22780 ASPIRIN              PDR     NULL    4.7.1
>> > 856     22790 ASPIRIN             MIXT        $    4.7.1
>> >
>> >
>> > On Fri, Mar 12, 2010 at 12:51 PM, Natalie Van Zuydam
>> > <nvanzuydam@gmail.com> wrote:
>> > > ---------- Forwarded message ----------
>> > > From: Natalie Van Zuydam <nvanzuydam@gmail.com>
>> > > Date: Fri, Mar 12, 2010 at 5:49 PM
>> > > Subject: Re: [R] sqldf not joining all the fields
>> > > To: David Winsemius <dwinsemius@comcast.net>
>> > >
>> > >
>> > > Dear David
>> > >
>> > > I'm not sure what the problem is as for every item code
there is a
>> > > corresponding information in the y_data.  For example 842752
from the
>> > x_data
>> > > corresponds to Aspirin in the y_data?  Yet when I use sqldf
to join
>> the
>> > two
>> > > df's I get NA values in the columns from the y_data in z
for 842752
>> item
>> > > code....is there something wrong with my sqldf code or
something wrong
>> > with
>> > > the way I have inputed the data frames?
>> > >
>> > > Thanks for taking the time to help me,
>> > > Natalie
>> > >
>> > >
>> > >
>> > > On Fri, Mar 12, 2010 at 5:42 PM, David Winsemius <
>> dwinsemius@comcast.net
>> > >wrote:
>> > >
>> > >> If I assign the file input to y_data and change you sqldf
to
>> > >>
>> > >> > z  <- sqldf("select * from x_data left join
y_data using
>> > (item_code)"); z
>> > >>
>> > >> I can replicate your result. Even after changing the
types of the two
>> > >> item_code fields to match I still get the same result and
when I see
>> to
>> > what
>> > >> degree they share values I get:
>> > >>
>> > >> > sum(x_data$item_code %in% y_data$item_code)
>> > >> [1] 2
>> > >> > sum(y_data$item_code %in% x_data$item_code)
>> > >> [1] 2
>> > >>
>> > >>
>> > >> So why are you so sure they are "complete" as
you claimed in your
>> first
>> > >> email.
>> > >>
>> > >> --
>> > >> David.
>> > >>
>> > >>
>> > >>
>> > >> On Mar 12, 2010, at 12:29 PM, David Winsemius wrote:
>> > >>
>> > >>  You have now given two different assignments to x_data
and none to
>> > y_data:
>> > >>>
>> > >>> The str( from the file access offering:
>> > >>>
>> > >>> > str(x_data)
>> > >>> 'data.frame':   2848 obs. of  5 variables:
>> > >>> $ item_code       : int  100 110 150 160 161 164 200
210 212 220 ...
>> > >>> $ name            : chr  "NEONACLEX K"
"NEONACLEX" "MESORB"
>> "ABSORBENT
>> > >>> CELLULOSE MESO" ...
>> > >>> $ formulation_code: chr  "TABS"
"TABS" "DRESS" "DRESS" ...
>> > >>> $ strength        : chr  "NULL"
"5MG" "10CMX10CM" "10CMX10CM" ...
>> > >>> $ bnf_code        : chr  "2.2.8"
"2.2.1" "20.3.1" "20.3.1" ...
>> > >>>
>> > >>> The str from assignment from the dput offering
>> > >>> > str(x_data)
>> > >>> 'data.frame':   10 obs. of  10 variables:
>> > >>> $ prochi         : chr  "CAO0000713"
"CAO0000713" "CAO0000713"
>> > >>> "CAO0000713" ...
>> > >>> $ prescribed_date: chr  "22/06/2001"
"28/04/2000" "10/04/2000"
>> > >>> "03/07/2000" ...
>> > >>> $ dataMonth      : chr  "NULL"
"NULL" "NULL" "NULL" ...
>> > >>> $ item_code      : chr  "842752"
"7800" "842652" "842652" ...
>> > >>> $ res_seqno      : chr  "NULL"
"NULL" "NULL" "NULL" ...
>> > >>> $ quantity       : chr  "60"
"100G" "60" "60" ...
>> > >>> $ directions     : chr  "1/D"
"A/TD" "1/D" "1/D" ...
>> > >>> $ no_of_packs    : chr  "NULL"
"NULL" "NULL" "NULL" ...
>> > >>> $ datasource     : chr  "TSF"
"TSF" "TSF" "TSF" ...
>> > >>> $ scan_ref_no    : chr  "NULL"
"NULL" "NULL" "NULL" ...
>> > >>>
>> > >>> This code "worked", but it is not clear
that the x-y assignments
>> were
>> > >>> correct:
>> > >>>
>> > >>> x_data <- read.table(file="
>> > >>>
http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt",
>> > header
>> > >>> = TRUE, sep = "|", quote =
"\"'",
>> > >>>              dec = ".",as.is =
TRUE,na.strings = "NA",colClasses >> NA,
>> > >>> nrows = 3864284,
>> > >>>              skip = 0, check.names = TRUE,fill=TRUE,
>> > >>>              strip.white = TRUE, blank.lines.skip =
TRUE,
>> > >>>              comment.char = "#",
allowEscapes = FALSE, flush >> FALSE,
>> > >>>              fileEncoding = "", encoding =
"unknown")
>> > >>>
>> > >>> --
>> > >>> David.
>> > >>>
>> > >>> On Mar 12, 2010, at 12:23 PM, Newbie19_02 wrote:
>> > >>>
>> > >>>
>> > >>>> The y_data file has over 9000 rows in it so I
thought it would be
>> more
>> > >>>> practical to give you the file to download....
>> > >>>> --
>> > >>>> View this message in context:
>> > >>>>
>> >
>>
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590833.html
>> > >>>> Sent from the R help mailing list archive at
Nabble.com.
>> > >>>>
>> > >>>> ______________________________________________
>> > >>>> R-help@r-project.org mailing list
>> > >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> > >>>> PLEASE do read the posting guide
>> > >>>> http://www.R-project.org/posting-guide.html
>> > >>>> and provide commented, minimal, self-contained,
reproducible code.
>> > >>>>
>> > >>>
>> > >>> David Winsemius, MD
>> > >>> West Hartford, CT
>> > >>>
>> > >>> ______________________________________________
>> > >>> R-help@r-project.org mailing list
>> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> > >>> PLEASE do read the posting guide
>> > >>> http://www.R-project.org/posting-guide.html
>> > >>> and provide commented, minimal, self-contained,
reproducible code.
>> > >>>
>> > >>
>> > >> David Winsemius, MD
>> > >> West Hartford, CT
>> > >>
>> > >>
>> > >
>> > >        [[alternative HTML version deleted]]
>> > >
>> > > ______________________________________________
>> > > R-help@r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>> > > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > > and provide commented, minimal, self-contained, reproducible
code.
>> > >
>> >
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Mar 2010 - sqldf not joining all the fields

[R] sqldf not joining all the fields

[R] sqldf not joining all the fields

[R] sqldf not joining all the fields

[R] sqldf not joining all the fields

[R] sqldf not joining all the fields

[R] sqldf not joining all the fields

[R] Fwd: sqldf not joining all the fields

[R] Fwd: sqldf not joining all the fields

Possibly Parallel Threads