Dear R users, I have two data frames that were read from text files as follows: x_data <- read.table("x.txt", header = TRUE, sep = "|", quote = "\"'", dec = ".",as.is = TRUE,na.strings = "NA",colClasses = NA, nrows = 3864284, skip = 0, check.names = TRUE,fill=TRUE, strip.white = TRUE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, fileEncoding = "", encoding = "unknown") x_data prochi prescribed_date dataMonth item_code res_seqno quantity directions CAO0000713 22/06/2001 NULL 842752 NULL 60 1/D CAO0000713 28/04/2000 NULL 7800 NULL 100G A/TD CAO0000713 10/04/2000 NULL 842652 NULL 60 1/D CAO0000713 03/07/2000 NULL 842652 NULL 60 1/D CAO0000713 09/01/2001 NULL 842752 NULL 60 1/D CAO0000713 16/10/2001 NULL 842752 NULL 60 1/D CAO0000713 16/08/2001 NULL 842752 NULL 60 1/D CAO0000713 17/09/1993 NULL 39620 NULL 5ML NIL CAO0000713 01/05/2001 NULL 842752 NULL 60 1/D CAO0000713 05/03/2001 NULL 842752 NULL 60 1/D y_data item_code name formulation_code strength bnf_code 100 NEONACLEX K TABS NULL 2.2.8 110 NEONACLEX TABS 5MG 2.2.1 50 MESORB DRESS 10CMX10CM 20.3.1 160 ABSORBENT CELLULOSE MESO DRESS 10CMX10CM 20.3.1 161 ABSORBENT CELLULOSE MESO DRESS 10CMX15CM 20.3.1 164 ABSORBENT CELLULOSE MESO DRESS 20CMX25CM 20.3.1 200 SEPTRIN TABS 480MG 5.1.8 210 SEPTRIN PAED SF SUSP 240MG/5ML 5.1.8 212 SEPTRIN ADULT SUSP 480MG/5ML 5.1.8 220 SEPTRIN FORTE TABS 960MG 5.1.8 etc.... contains all the information for the item codes y was read in in the same way. I then used the following code: z <- sqldf("select * from x left join y using (code)") when I use this on my real data I get an output: prochi prescribed_date dataMonth item_code res_seqno quantity directions 1 CAO0000713 22/06/2001 NULL 842752 NULL 60 1/D 2 CAO0000713 28/04/2000 NULL 7800 NULL 100G A/TD 3 CAO0000713 10/04/2000 NULL 842652 NULL 60 1/D 4 CAO0000713 03/07/2000 NULL 842652 NULL 60 1/D 5 CAO0000713 09/01/2001 NULL 842752 NULL 60 1/D 6 CAO0000713 16/10/2001 NULL 842752 NULL 60 1/D 7 CAO0000713 16/08/2001 NULL 842752 NULL 60 1/D 8 CAO0000713 17/09/1993 NULL 39620 NULL 5ML NIL 9 CAO0000713 01/05/2001 NULL 842752 NULL 60 1/D 10 CAO0000713 05/03/2001 NULL 842752 NULL 60 1/D no_of_packs datasource scan_ref_no name formulation_code strength 1 NULL TSF NULL <NA> <NA> <NA> 2 NULL TSF NULL BETNOVATE RD OINT 0.025% 3 NULL TSF NULL <NA> <NA> <NA> 4 NULL TSF NULL <NA> <NA> <NA> 5 NULL TSF NULL <NA> <NA> <NA> 6 NULL TSF NULL <NA> <NA> <NA> 7 NULL TSF NULL <NA> <NA> <NA> 8 NULL TSF NULL GAMMABULIN INJ 320MG 9 NULL TSF NULL <NA> <NA> <NA> 10 NULL TSF NULL <NA> <NA> <NA> bnf_code 1 <NA> 2 13.4.1.2 3 <NA> 4 <NA> 5 <NA> 6 <NA> 7 <NA> 8 14.5 9 <NA> 10 <NA> There is absolutely no reason for there to be <NA> anywhere as the information for both the tables is complete. Not sure what the problem is? Thanks, Natalie -- View this message in context: http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590786.html Sent from the R help mailing list archive at Nabble.com.
Can you show the output of dput(x_data) and dput(y_data). On Fri, Mar 12, 2010 at 11:56 AM, Newbie19_02 <nvanzuydam at gmail.com> wrote:> > Dear R users, > > I have two data frames that were read from text files as follows: > > x_data <- read.table("x.txt", header = TRUE, sep = "|", quote = "\"'", > ? ? ? ? ? ? ? ?dec = ".",as.is = TRUE,na.strings = "NA",colClasses = NA, > nrows = 3864284, > ? ? ? ? ? ? ? ?skip = 0, check.names = TRUE,fill=TRUE, > ? ? ? ? ? ? ? ?strip.white = TRUE, blank.lines.skip = TRUE, > ? ? ? ? ? ? ? ?comment.char = "#", allowEscapes = FALSE, flush = FALSE, > ? ? ? ? ? ? ? ?fileEncoding = "", encoding = "unknown") > > x_data > > prochi prescribed_date dataMonth item_code res_seqno quantity directions > CAO0000713 ? ? ?22/06/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ? ?1/D > CAO0000713 ? ? ?28/04/2000 ? ? ?NULL ? ? ?7800 ? ? ?NULL ? ? 100G ? ? ? A/TD > CAO0000713 ? ? ?10/04/2000 ? ? ?NULL ? ?842652 ? ? ?NULL ? ? ? 60 ? ? ? ?1/D > CAO0000713 ? ? ?03/07/2000 ? ? ?NULL ? ?842652 ? ? ?NULL ? ? ? 60 ? ? ? ?1/D > CAO0000713 ? ? ?09/01/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ? ?1/D > CAO0000713 ? ? ?16/10/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ? ?1/D > CAO0000713 ? ? ?16/08/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ? ?1/D > CAO0000713 ? ? ?17/09/1993 ? ? ?NULL ? ? 39620 ? ? ?NULL ? ? ?5ML ? ? ? ?NIL > CAO0000713 ? ? ?01/05/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ? ?1/D > CAO0000713 ? ? ?05/03/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 ? ? ? ?1/D > > > > y_data > > ?item_code ? ?name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?formulation_code ?strength > bnf_code > 100 ? ? ? ? ? ? ?NEONACLEX K ? ? ? ? ? ? ? ? ? ?TABS ? ? ?NULL ? ?2.2.8 > 110 ? ? ? ? ? ? ? ?NEONACLEX ? ? ? ? ? ? ? ? ? ? TABS ? ? ? 5MG ? ?2.2.1 > 50 ? ? ? ? ? ? ? ? ? MESORB ? ? ? ? ? ? ? ? ? ? ? ? DRESS 10CMX10CM ? 20.3.1 > 160 ABSORBENT CELLULOSE MESO ? ? ? ? ? ?DRESS 10CMX10CM ? 20.3.1 > 161 ABSORBENT CELLULOSE MESO ? ? ? ? ? ?DRESS 10CMX15CM ? 20.3.1 > 164 ABSORBENT CELLULOSE MESO ? ? ? ? ? ?DRESS 20CMX25CM ? 20.3.1 > 200 ? ? ? ? ? ? ? ? ?SEPTRIN ? ? ? ? ? ? ? ? ? ? ? ?TABS ? ? 480MG ? ?5.1.8 > 210 ? ? ? ? ?SEPTRIN PAED SF ? ? ? ? ? ? ? ? ? ?SUSP 240MG/5ML ? ?5.1.8 > 212 ? ? ? ? ? ?SEPTRIN ADULT ? ? ? ? ? ? ? ? ? ? SUSP 480MG/5ML ? ?5.1.8 > 220 ? ? ? ? ? ?SEPTRIN FORTE ? ? ? ? ? ? ? ? ? ? TABS ? ? 960MG ? ?5.1.8 > ?etc.... > > > contains all the information for the item codes > y was read in in the same way. > > I then used the following code: > > z ?<- sqldf("select * from x left join y using (code)") > > when I use this on my real data I get an output: > ?prochi prescribed_date dataMonth item_code res_seqno quantity directions > 1 ?CAO0000713 ? ? ?22/06/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 > 1/D > 2 ?CAO0000713 ? ? ?28/04/2000 ? ? ?NULL ? ? ?7800 ? ? ?NULL ? ? 100G > A/TD > 3 ?CAO0000713 ? ? ?10/04/2000 ? ? ?NULL ? ?842652 ? ? ?NULL ? ? ? 60 > 1/D > 4 ?CAO0000713 ? ? ?03/07/2000 ? ? ?NULL ? ?842652 ? ? ?NULL ? ? ? 60 > 1/D > 5 ?CAO0000713 ? ? ?09/01/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 > 1/D > 6 ?CAO0000713 ? ? ?16/10/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 > 1/D > 7 ?CAO0000713 ? ? ?16/08/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 > 1/D > 8 ?CAO0000713 ? ? ?17/09/1993 ? ? ?NULL ? ? 39620 ? ? ?NULL ? ? ?5ML > NIL > 9 ?CAO0000713 ? ? ?01/05/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 > 1/D > 10 CAO0000713 ? ? ?05/03/2001 ? ? ?NULL ? ?842752 ? ? ?NULL ? ? ? 60 > 1/D > ? no_of_packs datasource scan_ref_no ? ? ? ? name formulation_code strength > 1 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ? <NA> ? ? <NA> > 2 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL BETNOVATE RD ? ? ? ? ? ? OINT ? 0.025% > 3 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ? <NA> ? ? <NA> > 4 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ? <NA> ? ? <NA> > 5 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ? <NA> ? ? <NA> > 6 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ? <NA> ? ? <NA> > 7 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ? <NA> ? ? <NA> > 8 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? GAMMABULIN ? ? ? ? ? ? ?INJ ? ?320MG > 9 ? ? ? ? NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ? <NA> ? ? <NA> > 10 ? ? ? ?NULL ? ? ? ?TSF ? ? ? ?NULL ? ? ? ? <NA> ? ? ? ? ? ? <NA> ? ? <NA> > ? bnf_code > 1 ? ? ?<NA> > 2 ?13.4.1.2 > 3 ? ? ?<NA> > 4 ? ? ?<NA> > 5 ? ? ?<NA> > 6 ? ? ?<NA> > 7 ? ? ?<NA> > 8 ? ? ?14.5 > 9 ? ? ?<NA> > 10 ? ? <NA> > > > There is absolutely no reason for there to be <NA> anywhere as the > information for both the tables is complete. > > Not sure what the problem is? > > Thanks, > Natalie > -- > View this message in context: http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590786.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt feb09_267_presc_items_tsf.txt is the total file for y so if I use the command line with the total data for y then I get the output specified in z Thanks, Natalie -- View this message in context: http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590804.html Sent from the R help mailing list archive at Nabble.com.
dput(x_data) structure(list(prochi = c("CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713"), prescribed_date = c("22/06/2001", "28/04/2000", "10/04/2000", "03/07/2000", "09/01/2001", "16/10/2001", "16/08/2001", "17/09/1993", "01/05/2001", "05/03/2001"), dataMonth c("NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL"), item_code = c("842752", "7800", "842652", "842652", "842752", "842752", "842752", "39620", "842752", "842752"), res_seqno c("NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL"), quantity = c("60", "100G", "60", "60", "60", "60", "60", "5ML", "60", "60"), directions = c("1/D", "A/TD", "1/D", "1/D", "1/D", "1/D", "1/D", "NIL", "1/D", "1/D"), no_of_packs = c("NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL"), datasource = c("TSF", "TSF", "TSF", "TSF", "TSF", "TSF", "TSF", "TSF", "TSF", "TSF"), scan_ref_no = c("NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL" )), .Names = c("prochi", "prescribed_date", "dataMonth", "item_code", "res_seqno", "quantity", "directions", "no_of_packs", "datasource", "scan_ref_no"), row.names = c(NA, 10L), class = "data.frame") -- View this message in context: http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590821.html Sent from the R help mailing list archive at Nabble.com.
Please provide code that I can just copy from your post and paste into my session. Either provide dput output as requested or provide the files on the internet together with code that reads them off the internet. On Fri, Mar 12, 2010 at 12:06 PM, Newbie19_02 <nvanzuydam at gmail.com> wrote:> > http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt > feb09_267_presc_items_tsf.txt > > is the total file for y so if I use the command line with the total data for > y then I get the output specified in z > > Thanks, > Natalie > -- > View this message in context: http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590804.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
What about y_data? On Fri, Mar 12, 2010 at 12:14 PM, Newbie19_02 <nvanzuydam at gmail.com> wrote:> > dput(x_data) > > structure(list(prochi = c("CAO0000713", "CAO0000713", "CAO0000713", > "CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713", "CAO0000713", > "CAO0000713", "CAO0000713"), prescribed_date = c("22/06/2001", > "28/04/2000", "10/04/2000", "03/07/2000", "09/01/2001", "16/10/2001", > "16/08/2001", "17/09/1993", "01/05/2001", "05/03/2001"), dataMonth > c("NULL", > "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", > "NULL"), item_code = c("842752", "7800", "842652", "842652", > "842752", "842752", "842752", "39620", "842752", "842752"), res_seqno > c("NULL", > "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", > "NULL"), quantity = c("60", "100G", "60", "60", "60", "60", "60", > "5ML", "60", "60"), directions = c("1/D", "A/TD", "1/D", "1/D", > "1/D", "1/D", "1/D", "NIL", "1/D", "1/D"), no_of_packs = c("NULL", > "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", > "NULL"), datasource = c("TSF", "TSF", "TSF", "TSF", "TSF", "TSF", > "TSF", "TSF", "TSF", "TSF"), scan_ref_no = c("NULL", "NULL", > "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL" > )), .Names = c("prochi", "prescribed_date", "dataMonth", "item_code", > "res_seqno", "quantity", "directions", "no_of_packs", "datasource", > "scan_ref_no"), row.names = c(NA, 10L), class = "data.frame") > > -- > View this message in context: http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590821.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
---------- Forwarded message ---------- From: Natalie Van Zuydam <nvanzuydam@gmail.com> Date: Fri, Mar 12, 2010 at 5:49 PM Subject: Re: [R] sqldf not joining all the fields To: David Winsemius <dwinsemius@comcast.net> Dear David I'm not sure what the problem is as for every item code there is a corresponding information in the y_data. For example 842752 from the x_data corresponds to Aspirin in the y_data? Yet when I use sqldf to join the two df's I get NA values in the columns from the y_data in z for 842752 item code....is there something wrong with my sqldf code or something wrong with the way I have inputed the data frames? Thanks for taking the time to help me, Natalie On Fri, Mar 12, 2010 at 5:42 PM, David Winsemius <dwinsemius@comcast.net>wrote:> If I assign the file input to y_data and change you sqldf to > > > z <- sqldf("select * from x_data left join y_data using (item_code)"); z > > I can replicate your result. Even after changing the types of the two > item_code fields to match I still get the same result and when I see to what > degree they share values I get: > > > sum(x_data$item_code %in% y_data$item_code) > [1] 2 > > sum(y_data$item_code %in% x_data$item_code) > [1] 2 > > > So why are you so sure they are "complete" as you claimed in your first > email. > > -- > David. > > > > On Mar 12, 2010, at 12:29 PM, David Winsemius wrote: > > You have now given two different assignments to x_data and none to y_data: >> >> The str( from the file access offering: >> >> > str(x_data) >> 'data.frame': 2848 obs. of 5 variables: >> $ item_code : int 100 110 150 160 161 164 200 210 212 220 ... >> $ name : chr "NEONACLEX K" "NEONACLEX" "MESORB" "ABSORBENT >> CELLULOSE MESO" ... >> $ formulation_code: chr "TABS" "TABS" "DRESS" "DRESS" ... >> $ strength : chr "NULL" "5MG" "10CMX10CM" "10CMX10CM" ... >> $ bnf_code : chr "2.2.8" "2.2.1" "20.3.1" "20.3.1" ... >> >> The str from assignment from the dput offering >> > str(x_data) >> 'data.frame': 10 obs. of 10 variables: >> $ prochi : chr "CAO0000713" "CAO0000713" "CAO0000713" >> "CAO0000713" ... >> $ prescribed_date: chr "22/06/2001" "28/04/2000" "10/04/2000" >> "03/07/2000" ... >> $ dataMonth : chr "NULL" "NULL" "NULL" "NULL" ... >> $ item_code : chr "842752" "7800" "842652" "842652" ... >> $ res_seqno : chr "NULL" "NULL" "NULL" "NULL" ... >> $ quantity : chr "60" "100G" "60" "60" ... >> $ directions : chr "1/D" "A/TD" "1/D" "1/D" ... >> $ no_of_packs : chr "NULL" "NULL" "NULL" "NULL" ... >> $ datasource : chr "TSF" "TSF" "TSF" "TSF" ... >> $ scan_ref_no : chr "NULL" "NULL" "NULL" "NULL" ... >> >> This code "worked", but it is not clear that the x-y assignments were >> correct: >> >> x_data <- read.table(file=" >> http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt", header >> = TRUE, sep = "|", quote = "\"'", >> dec = ".",as.is = TRUE,na.strings = "NA",colClasses = NA, >> nrows = 3864284, >> skip = 0, check.names = TRUE,fill=TRUE, >> strip.white = TRUE, blank.lines.skip = TRUE, >> comment.char = "#", allowEscapes = FALSE, flush = FALSE, >> fileEncoding = "", encoding = "unknown") >> >> -- >> David. >> >> On Mar 12, 2010, at 12:23 PM, Newbie19_02 wrote: >> >> >>> The y_data file has over 9000 rows in it so I thought it would be more >>> practical to give you the file to download.... >>> -- >>> View this message in context: >>> http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590833.html >>> Sent from the R help mailing list archive at Nabble.com. >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> David Winsemius, MD >> West Hartford, CT >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > David Winsemius, MD > West Hartford, CT > >[[alternative HTML version deleted]]
Hi, I did quote the string and as I told read.table to strip.white I also tried it with no spaces. "ASPIRIN DISP AAH" %in% tsf_data$name [1] FALSE "ASPIRINDISPAAH" %in% tsf_data$name [1] FALSE I have looked at the last lines of my y_data object and there is a problem with the file that I'm going to try to use perl to sort out if I y_data[2847:2848,] RONIC ACID|TABS|5MG|6.6.2\n652903|ALENDRONIC ACID Once Wee|TABS|70MG|1.1.1\n653000|AZATHIOPRINE|INJ|50MG|8.2.1\n653200|DORZOLAMIDE EYE|DROPS|2%|11.6\n653500|TOPIRAMATE|TABS|50MG|4.8.1\n653510|TOPIRAMATE|TABS|100MG|4.8.1\n653513|TOPIRAMATE|TABS|25MG|4.8.1\n653600|GUAR GUM SF|SACH|5G|6.1.2.3\n653700|TACALCITOL|OINT|NULL|13.5.2\n654000|COCODAMOL|CAPS|30/500MG|4.7.1\n654010|COCODAMOL|TABS|12.8/500M|4.7.1\n654020|COCODAMOL|SACH|30/500MG|4.7.1\n654300|ACAMPROSATE CALCIUM|TABS|333MG|4.10\n654400|ACECLOFENAC|TABS|100MG|10.1.1\n654500|TILUDRONIC ACID|TABS|200MG|6.6.2\n654600|TAMSULOSIN HCL MR|CAPS|400MCG|7.4.1\n654800|PENCICLOVIR|CREAM|1%|13.10.3\...... ........formulation_code strength bnf_code 2847 INJ 1MG/ML 3.4.3 2848 The output looks like this so it must be the input files and not an sqldf problem. I didn't see this in the script editor I was using. Natalie On Fri, Mar 12, 2010 at 7:05 PM, Dennis Murphy <djmuser@gmail.com> wrote:> Hi: > > Part of the problem is that your string contains spaces; in your example > below, you would need to quote the string, I believe, as in > > "ASPIRIN DISP AAH" %in% y$name [untested...] > > HTH, > Dennis > > > On Fri, Mar 12, 2010 at 10:59 AM, Natalie Van Zuydam <nvanzuydam@gmail.com > > wrote: > >> 842752|ASPIRIN DISP AAH|TABS|75MG|2.9 is taken directly from my y_data >> text >> file. >> >> If I search for ASPIRIN DISP AAH %in% y$name I get: FALSE. Despite the >> fact that it is there in the text file that I loaded into y. There must be >> a >> problem with my input. >> >> tsf_data <- read.table("feb09_267_presc_items_tsf.txt", header = TRUE, sep >> >> "|", quote = "\"'", >> dec = ".",as.is = TRUE,na.strings = "NA",colClasses = NA, >> nrows = 3864284, >> skip = 0, check.names = TRUE,fill=TRUE, >> strip.white = TRUE, blank.lines.skip = TRUE, >> comment.char = "#", allowEscapes = FALSE, flush = FALSE, >> fileEncoding = "", encoding = "unknown") >> >> Would fill=TRUE and strip.white=TRUE affect how the item_code columns are >> matched? When I look at the file in a script editor I cannot see any >> strange symbols or formatting? I have had to use fill otherwise I get the >> following error message: >> >> Warning message: >> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : >> number of items read is not a multiple of the number of columns >> >> Not sure if correcting this in read.table is affecting the join function >> of >> sqldb? >> >> Thanks, >> Natalie >> >> >> >> I also tried 842752 >> >> On Fri, Mar 12, 2010 at 6:05 PM, Gabor Grothendieck < >> ggrothendieck@gmail.com >> > wrote: >> >> > That is not so. 842752 does not exist in y$item_code and ASPIRIN has >> > a code of 22730. >> > >> > > 842752 %in% y$item_code >> > [1] FALSE >> > >> > > subset(y, name == "ASPIRIN") >> > item_code name formulation_code strength bnf_code >> > 850 22730 ASPIRIN TABS 300MG 4.7.1 >> > 855 22780 ASPIRIN PDR NULL 4.7.1 >> > 856 22790 ASPIRIN MIXT $ 4.7.1 >> > >> > >> > On Fri, Mar 12, 2010 at 12:51 PM, Natalie Van Zuydam >> > <nvanzuydam@gmail.com> wrote: >> > > ---------- Forwarded message ---------- >> > > From: Natalie Van Zuydam <nvanzuydam@gmail.com> >> > > Date: Fri, Mar 12, 2010 at 5:49 PM >> > > Subject: Re: [R] sqldf not joining all the fields >> > > To: David Winsemius <dwinsemius@comcast.net> >> > > >> > > >> > > Dear David >> > > >> > > I'm not sure what the problem is as for every item code there is a >> > > corresponding information in the y_data. For example 842752 from the >> > x_data >> > > corresponds to Aspirin in the y_data? Yet when I use sqldf to join >> the >> > two >> > > df's I get NA values in the columns from the y_data in z for 842752 >> item >> > > code....is there something wrong with my sqldf code or something wrong >> > with >> > > the way I have inputed the data frames? >> > > >> > > Thanks for taking the time to help me, >> > > Natalie >> > > >> > > >> > > >> > > On Fri, Mar 12, 2010 at 5:42 PM, David Winsemius < >> dwinsemius@comcast.net >> > >wrote: >> > > >> > >> If I assign the file input to y_data and change you sqldf to >> > >> >> > >> > z <- sqldf("select * from x_data left join y_data using >> > (item_code)"); z >> > >> >> > >> I can replicate your result. Even after changing the types of the two >> > >> item_code fields to match I still get the same result and when I see >> to >> > what >> > >> degree they share values I get: >> > >> >> > >> > sum(x_data$item_code %in% y_data$item_code) >> > >> [1] 2 >> > >> > sum(y_data$item_code %in% x_data$item_code) >> > >> [1] 2 >> > >> >> > >> >> > >> So why are you so sure they are "complete" as you claimed in your >> first >> > >> email. >> > >> >> > >> -- >> > >> David. >> > >> >> > >> >> > >> >> > >> On Mar 12, 2010, at 12:29 PM, David Winsemius wrote: >> > >> >> > >> You have now given two different assignments to x_data and none to >> > y_data: >> > >>> >> > >>> The str( from the file access offering: >> > >>> >> > >>> > str(x_data) >> > >>> 'data.frame': 2848 obs. of 5 variables: >> > >>> $ item_code : int 100 110 150 160 161 164 200 210 212 220 ... >> > >>> $ name : chr "NEONACLEX K" "NEONACLEX" "MESORB" >> "ABSORBENT >> > >>> CELLULOSE MESO" ... >> > >>> $ formulation_code: chr "TABS" "TABS" "DRESS" "DRESS" ... >> > >>> $ strength : chr "NULL" "5MG" "10CMX10CM" "10CMX10CM" ... >> > >>> $ bnf_code : chr "2.2.8" "2.2.1" "20.3.1" "20.3.1" ... >> > >>> >> > >>> The str from assignment from the dput offering >> > >>> > str(x_data) >> > >>> 'data.frame': 10 obs. of 10 variables: >> > >>> $ prochi : chr "CAO0000713" "CAO0000713" "CAO0000713" >> > >>> "CAO0000713" ... >> > >>> $ prescribed_date: chr "22/06/2001" "28/04/2000" "10/04/2000" >> > >>> "03/07/2000" ... >> > >>> $ dataMonth : chr "NULL" "NULL" "NULL" "NULL" ... >> > >>> $ item_code : chr "842752" "7800" "842652" "842652" ... >> > >>> $ res_seqno : chr "NULL" "NULL" "NULL" "NULL" ... >> > >>> $ quantity : chr "60" "100G" "60" "60" ... >> > >>> $ directions : chr "1/D" "A/TD" "1/D" "1/D" ... >> > >>> $ no_of_packs : chr "NULL" "NULL" "NULL" "NULL" ... >> > >>> $ datasource : chr "TSF" "TSF" "TSF" "TSF" ... >> > >>> $ scan_ref_no : chr "NULL" "NULL" "NULL" "NULL" ... >> > >>> >> > >>> This code "worked", but it is not clear that the x-y assignments >> were >> > >>> correct: >> > >>> >> > >>> x_data <- read.table(file=" >> > >>> http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt", >> > header >> > >>> = TRUE, sep = "|", quote = "\"'", >> > >>> dec = ".",as.is = TRUE,na.strings = "NA",colClasses >> NA, >> > >>> nrows = 3864284, >> > >>> skip = 0, check.names = TRUE,fill=TRUE, >> > >>> strip.white = TRUE, blank.lines.skip = TRUE, >> > >>> comment.char = "#", allowEscapes = FALSE, flush >> FALSE, >> > >>> fileEncoding = "", encoding = "unknown") >> > >>> >> > >>> -- >> > >>> David. >> > >>> >> > >>> On Mar 12, 2010, at 12:23 PM, Newbie19_02 wrote: >> > >>> >> > >>> >> > >>>> The y_data file has over 9000 rows in it so I thought it would be >> more >> > >>>> practical to give you the file to download.... >> > >>>> -- >> > >>>> View this message in context: >> > >>>> >> > >> http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590833.html >> > >>>> Sent from the R help mailing list archive at Nabble.com. >> > >>>> >> > >>>> ______________________________________________ >> > >>>> R-help@r-project.org mailing list >> > >>>> https://stat.ethz.ch/mailman/listinfo/r-help >> > >>>> PLEASE do read the posting guide >> > >>>> http://www.R-project.org/posting-guide.html >> > >>>> and provide commented, minimal, self-contained, reproducible code. >> > >>>> >> > >>> >> > >>> David Winsemius, MD >> > >>> West Hartford, CT >> > >>> >> > >>> ______________________________________________ >> > >>> R-help@r-project.org mailing list >> > >>> https://stat.ethz.ch/mailman/listinfo/r-help >> > >>> PLEASE do read the posting guide >> > >>> http://www.R-project.org/posting-guide.html >> > >>> and provide commented, minimal, self-contained, reproducible code. >> > >>> >> > >> >> > >> David Winsemius, MD >> > >> West Hartford, CT >> > >> >> > >> >> > > >> > > [[alternative HTML version deleted]] >> > > >> > > ______________________________________________ >> > > R-help@r-project.org mailing list >> > > https://stat.ethz.ch/mailman/listinfo/r-help >> > > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > > and provide commented, minimal, self-contained, reproducible code. >> > > >> > >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >[[alternative HTML version deleted]]