Good afternoon all - While making a steady progress in learning R after Matlab I encountered a problem which seems to require some extra help to move over. Basically I want to merge a data from biological statistical dataset with annotation data extracted from another dataset using an 'id' crossreference and write it to report file. The first part goes absolutely fine, I have merged both data into data.frame but when I try to write it to csv file using 'write.table' it seems like it does write the 'data.frame' object but it also insert some parts from the annotation data which are not suppose to be there... There is a little snapshot of the file output below to illustrate. The upper half is fine, that's how it should be. The lower half, which is actually appears to be space-separated, not coma, obviously grabbed from the annotation dataset and is not supposed to be here. --------------------------------8<-------------------------------------------- "344","166128",126.44286392082,179.904700814932,72.9810270267088,0.40566492535281,-1.3016395254146,2.47449355237252e-07,4.2901159299567e-06,"Chitinas "18816","238247",92.5282508325735,135.981255262454,49.0752464026927,0.36089714209487,-1.47034037615176,2.5330054329543e-07,4.38862252337004e-06,"Prot "22072","222365",30.8191942806426,52.4262903365628,9.21209822472236,0.17571524068522,-2.50868876576414,2.54433836512085e-07,4.40531098485028e-06,NA,N "25062","226605",30.808007579908,50.3976662241578,11.2183489356581,0.22259659575825,-2.16749656564076,2.54934711860645e-07,4.41103467375713e-06,NA,NA "7539","247009",75.4175439970731,34.4643221134552,116.370765880691,3.37655751642533,1.75555313265164,2.60010673210741e-07,4.49585878338091e-06,NA,NA, "407","267139",425.559675915702,279.393013150954,571.72633868045,2.04631580522577,1.03302881149302,2.61074218843609e-07,4.51123710239304e-06,NA,NA,NA "26530","171300",146.80096060985,80.0063286553601,213.595592564339,2.66973370924738,1.4166958484644,2.68061220749976e-07,4.62888115991058e-06,NA,NA,N "3078","159013",34.3260176515511,52.4580790080106,16.1939562950917,0.308702808057816,-1.69570948866688,2.69104298652827e-07,4.64379716436078e-06,"40S "4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP- 171597 171597 KOG1347 Uncharacterized membrane protein, predicted efflux pump General function prediction only POORLY CHARACTERIZED 171658 171658 KOG4290 Predicted membrane protein Function unknown POORLY CHARACTERIZED 171660 171660 KOG0903 Phosphatidylinositol 4-kinase, involved in intracellular trafficking and secretion Signal transduction mechanisms CELLULAR 171660 171660 KOG0903 Phosphatidylinositol 4-kinase, involved in intracellular trafficking and secretion Intracellular trafficking, secretion, and 171703 171703 KOG2674 Cysteine protease required for autophagy - Apg4p/Aut2p Cytoskeleton CELLULAR PROCESSES AND SIGNALING 171703 171703 KOG2674 Cysteine protease required for autophagy - Apg4p/Aut2p Intracellular trafficking, secretion, and vesicular transport CELLU and metabolism METABOLISM --------------------------------8<-------------------------------------------- And this is a piece of code that produced this: --------------------------------8<-------------------------------------------->n = nrow(statdata) >extra = data.frame(kogdefline=rep(NA,n), kogClass = rep(NA,n), kogGroup= rep(NA,n))>subset = intersect(statdata$id, annot$id) >MR = match(subset, annot$id) >ML = match(subset, statdata$id)>extra[ML,1] = as.character(annot[MR,2]) >extra[ML,2] = as.character(annot[MR,3]) >extra[ML,3] = as.character(annot[MR,4])# strangely, if I do # extra[ML,] = as.character(annot[MR,2:4]) # it produces digits (???) instead of a string value>mergedData = data.frame(statdata, extra) >write.table(mergedData, 'filename.csv', sep=',')--------------------------------8<-------------------------------------------- Any ideas why this is happening? Many thanks -Igor
On Sep 19, 2012, at 9:12 AM, Igor wrote:> Good afternoon all - > > While making a steady progress in learning R after Matlab I encountered > a problem which seems to require some extra help to move over. > Basically I want to merge a data from biological statistical dataset > with annotation data extracted from another dataset using an 'id' > crossreference and write it to report file. The first part goes > absolutely fine, I have merged both data into data.frame but when I try > to write it to csv file using 'write.table' it seems like it does write > the 'data.frame' object but it also insert some parts from the > annotation data which are not suppose to be there... > There is a little snapshot of the file output below to illustrate. The > upper half is fine, that's how it should be. The lower half, which is > actually appears to be space-separated, not coma, obviously grabbed from > the annotation dataset and is not supposed to be here. > > --------------------------------8<-------------------------------------------- > "344","166128",126.44286392082,179.904700814932,72.9810270267088,0.40566492535281,-1.3016395254146,2.47449355237252e-07,4.2901159299567e-06,"Chitinas > "18816","238247",92.5282508325735,135.981255262454,49.0752464026927,0.36089714209487,-1.47034037615176,2.5330054329543e-07,4.38862252337004e-06,"Prot > "22072","222365",30.8191942806426,52.4262903365628,9.21209822472236,0.17571524068522,-2.50868876576414,2.54433836512085e-07,4.40531098485028e-06,NA,N > "25062","226605",30.808007579908,50.3976662241578,11.2183489356581,0.22259659575825,-2.16749656564076,2.54934711860645e-07,4.41103467375713e-06,NA,NA > "7539","247009",75.4175439970731,34.4643221134552,116.370765880691,3.37655751642533,1.75555313265164,2.60010673210741e-07,4.49585878338091e-06,NA,NA, > "407","267139",425.559675915702,279.393013150954,571.72633868045,2.04631580522577,1.03302881149302,2.61074218843609e-07,4.51123710239304e-06,NA,NA,NA > "26530","171300",146.80096060985,80.0063286553601,213.595592564339,2.66973370924738,1.4166958484644,2.68061220749976e-07,4.62888115991058e-06,NA,NA,N > "3078","159013",34.3260176515511,52.4580790080106,16.1939562950917,0.308702808057816,-1.69570948866688,2.69104298652827e-07,4.64379716436078e-06,"40S > "4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP- > > 171597 171597 KOG1347 Uncharacterized membrane protein, predicted > efflux pump General function prediction only POORLY CHARACTERIZED > 171658 171658 KOG4290 Predicted membrane protein Function unknown > POORLY CHARACTERIZED > 171660 171660 KOG0903 Phosphatidylinositol 4-kinase, involved in > intracellular trafficking and secretion Signal transduction mechanisms > CELLULAR > 171660 171660 KOG0903 Phosphatidylinositol 4-kinase, involved in > intracellular trafficking and secretion Intracellular trafficking, > secretion, and > 171703 171703 KOG2674 Cysteine protease required for autophagy - > Apg4p/Aut2p Cytoskeleton CELLULAR PROCESSES AND SIGNALING > 171703 171703 KOG2674 Cysteine protease required for autophagy - > Apg4p/Aut2p Intracellular trafficking, secretion, and vesicular > transport CELLU > and metabolism METABOLISMThis looks like the sort of thing that occurs when there is a mismatched or missing double or single quote or perhaps comment character ( "#" that terminated a line read) somewhare. The logical place to look is in the line of data just above the pathological stretch of data. You have clearly only offered a truncated version of the data, since there are many instances of lines ending without matching quotes, even one in the first line. -- David.> --------------------------------8<-------------------------------------------- > And this is a piece of code that produced this: > > --------------------------------8<-------------------------------------------- >> n = nrow(statdata) >> extra = data.frame(kogdefline=rep(NA,n), kogClass = rep(NA,n), kogGroup > = rep(NA,n)) >> subset = intersect(statdata$id, annot$id) >> MR = match(subset, annot$id) >> ML = match(subset, statdata$id) > >> extra[ML,1] = as.character(annot[MR,2]) >> extra[ML,2] = as.character(annot[MR,3]) >> extra[ML,3] = as.character(annot[MR,4]) > # strangely, if I do > # extra[ML,] = as.character(annot[MR,2:4]) > # it produces digits (???) instead of a string value > >> mergedData = data.frame(statdata, extra) >> write.table(mergedData, 'filename.csv', sep=',') > --------------------------------8<-------------------------------------------- > > Any ideas why this is happening? > > Many thanks > -IgorDavid Winsemius, MD Alameda, CA, USA
It would also be helpful if you could provide the output of 'str' for all the objects that you are using. e.g., str(statdata) str(extra) Also in creating your data.frame, use "stringsAsFactors = FALSE": extra = data.frame(kogdefline=rep(NA,n) , kogClass = rep(NA,n) , kogGroup = rep(NA,n) , stringsAsFactors = FALSE ) On Wed, Sep 19, 2012 at 12:12 PM, Igor <igorc at essex.ac.uk> wrote:> Good afternoon all - > > While making a steady progress in learning R after Matlab I encountered > a problem which seems to require some extra help to move over. > Basically I want to merge a data from biological statistical dataset > with annotation data extracted from another dataset using an 'id' > crossreference and write it to report file. The first part goes > absolutely fine, I have merged both data into data.frame but when I try > to write it to csv file using 'write.table' it seems like it does write > the 'data.frame' object but it also insert some parts from the > annotation data which are not suppose to be there... > There is a little snapshot of the file output below to illustrate. The > upper half is fine, that's how it should be. The lower half, which is > actually appears to be space-separated, not coma, obviously grabbed from > the annotation dataset and is not supposed to be here. > > --------------------------------8<-------------------------------------------- > "344","166128",126.44286392082,179.904700814932,72.9810270267088,0.40566492535281,-1.3016395254146,2.47449355237252e-07,4.2901159299567e-06,"Chitinas > "18816","238247",92.5282508325735,135.981255262454,49.0752464026927,0.36089714209487,-1.47034037615176,2.5330054329543e-07,4.38862252337004e-06,"Prot > "22072","222365",30.8191942806426,52.4262903365628,9.21209822472236,0.17571524068522,-2.50868876576414,2.54433836512085e-07,4.40531098485028e-06,NA,N > "25062","226605",30.808007579908,50.3976662241578,11.2183489356581,0.22259659575825,-2.16749656564076,2.54934711860645e-07,4.41103467375713e-06,NA,NA > "7539","247009",75.4175439970731,34.4643221134552,116.370765880691,3.37655751642533,1.75555313265164,2.60010673210741e-07,4.49585878338091e-06,NA,NA, > "407","267139",425.559675915702,279.393013150954,571.72633868045,2.04631580522577,1.03302881149302,2.61074218843609e-07,4.51123710239304e-06,NA,NA,NA > "26530","171300",146.80096060985,80.0063286553601,213.595592564339,2.66973370924738,1.4166958484644,2.68061220749976e-07,4.62888115991058e-06,NA,NA,N > "3078","159013",34.3260176515511,52.4580790080106,16.1939562950917,0.308702808057816,-1.69570948866688,2.69104298652827e-07,4.64379716436078e-06,"40S > "4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP- > > 171597 171597 KOG1347 Uncharacterized membrane protein, predicted > efflux pump General function prediction only POORLY CHARACTERIZED > 171658 171658 KOG4290 Predicted membrane protein Function unknown > POORLY CHARACTERIZED > 171660 171660 KOG0903 Phosphatidylinositol 4-kinase, involved in > intracellular trafficking and secretion Signal transduction mechanisms > CELLULAR > 171660 171660 KOG0903 Phosphatidylinositol 4-kinase, involved in > intracellular trafficking and secretion Intracellular trafficking, > secretion, and > 171703 171703 KOG2674 Cysteine protease required for autophagy - > Apg4p/Aut2p Cytoskeleton CELLULAR PROCESSES AND SIGNALING > 171703 171703 KOG2674 Cysteine protease required for autophagy - > Apg4p/Aut2p Intracellular trafficking, secretion, and vesicular > transport CELLU > and metabolism METABOLISM > --------------------------------8<-------------------------------------------- > And this is a piece of code that produced this: > > --------------------------------8<-------------------------------------------- >>n = nrow(statdata) >>extra = data.frame(kogdefline=rep(NA,n), kogClass = rep(NA,n), kogGroup > = rep(NA,n)) >>subset = intersect(statdata$id, annot$id) >>MR = match(subset, annot$id) >>ML = match(subset, statdata$id) > >>extra[ML,1] = as.character(annot[MR,2]) >>extra[ML,2] = as.character(annot[MR,3]) >>extra[ML,3] = as.character(annot[MR,4]) > # strangely, if I do > # extra[ML,] = as.character(annot[MR,2:4]) > # it produces digits (???) instead of a string value > >>mergedData = data.frame(statdata, extra) >>write.table(mergedData, 'filename.csv', sep=',') > --------------------------------8<-------------------------------------------- > > Any ideas why this is happening? > > Many thanks > -Igor > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.