thr3ads.net - R help - [R] Truncation of character fields in a data frame [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Luigi Marongiu

2015-Sep-27 11:56 UTC

[R] Truncation of character fields in a data frame

Dear all,
I am reading a txt file into the R environment to create a data frame,
however I have notice that some entries have a truncated version of a
field, so for instance I get "Astro" instead of "Astro 1-Astro
1" and
"Sapo" for "Sapo #1-Sapo_1" and "Sapo #2-Sapo_2",
but I also get
"Adeno 40/41 EH-Adeno_40-41_EH" so the problem is not in the spaces
between the words. The txt file is a simple tab delimited file
generated from excel which I read with:

bad.data<-read.table(
    "test_df.txt",
    header=TRUE,
    row.names=1,
    dec = ".",
    sep="\t",
    stringsAsFactors = FALSE,
    fill = TRUE
)

[the fill = TRUE was introduced because in the real case I got an
error of a missing line.]

I can recreate this file as follows:
sample <- c(rep("p.001", 48), rep("p.547", 48))
target <- c("Adeno 1-Adeno 1",    "Adeno 40/41
EH-AIQJCT3",    "Astro
1-Astro 1",    "Sapo 1-Sapo 1",    "Sapo 2-Sapo 2",   
"Enterovirus
1-Enterovirus 1",    "Parechovirus-Parechovirus",    "HEV
1-HEV 1",
"IC PDV control-AIRSA0B",    "Rotavirus cam-Rotavirus cam",
"18S-Hs99999901_s1",    "Noro gp II-Noro gp II",   
"Noro gp 1-Noro gp
1",    "Noro gp 1 mod33-Noro gp 1 mod33",    "C difficile
GDH-AIS086J",    "C difficile Tox B-C difficile Tox B",   
"VTX
1-AIT97CR",    "BT control Man-AIVI5IZ",    "E. coli vtx
2-E. coli vtx
2",    "Campy spp-AIWR3O7",    "Salmonella
ttr-AIX01VF",    "Crypto
CP2-AIY9Z1N",    "Green Fluorescent Protein-AI0IX7V",   
"Adeno
2-Adeno 2",    "Adeno 40_41 Oly-AI1RWD3",    "Astro 2
Liu-AI20UKB",
"Giardia lambia 1-AI39SQJ",    "Rotavirus Liu-Rotavirus Liu
2",
"Enterovirus Bruges-Enterovirus 2 Br",    "HAV 1-Hepatitis A
1",
"HEV 2-AI5IQWR",    "MS2 control-AI6RO2Z",    "Rotarix
NSP2-AI70M87",
  "CMV br-CMV br",    "IC Rnase P-AI89LFF",   
"Salmonella hil
A-Salmonella hil A",    "Shigella ipa H-AIAA0K8",   
"Enteroagg E.
coli-AIBJYRG",    "Campy jejuni-AICSWXO",    "Campy
coli-AID1U3W",
"Yersinia enterocolitica-AIFAS94",    "Bacterial 16S-Bacterial
16S",
 "Aeromonas hydrophilia-Aeromonas hydrophilia",    "V
cholerae-AIGJRGC",    "Dientamoeba fragilis-AIHSPMK",   
"Entamoeba
histolytica-AII1NSS",    "Crypto 2 J-AIKALY0",    "Giardia
lambia
rev-AILJJ48",    "Adeno #1-Adeno_1",    "Adeno 40/41
EH-Adeno_40-41_EH",    "Astro #1-Astro_1",    "Sapo
#1-Sapo_1",
"Sapo #2-Sapo_2",    "Enterovirus #1-Enterovirus_1",
"Parechovirus-Parechovirus",    "HEV #1-HEV_1",    "C
coli jejuni
Liu-C_coli_jejuni_Li",    "Rotavirus cam-Rotavirus_cam",   
"IC 18s-IC
18s",    "Noro gp II-Noro_gp_II",    "Noro gp
1-Noro_gp_1",    "Noro
gp 1 mod33-Noro_gp_1_mod33",    "C difficile
GDH-C-difficile_GDH",
"C difficile Tox B-C_difficile_T_B",    "E. coli vtx
1-E_coli_vtx_1",
  "BT control Man-BT_control_Man",    "E. coli vtx
2-E_coli_vtx_2",
"Campy spp NEW-Campy_spp_NEW",    "Salmonella
ttr-Salmonella_ttr",
"Cryptosporidium spp CP2-Cryptos_spp_CP2",    "C jejuni
#2-C_jejuni_2",    "Adeno #2-Adeno_2",    "Adeno 40/41
Oly-Adeno_40-41_Oly",    "Astro Liu #2-Astro_Liu_2",   
"Giardia
lambia #1-Giardia_lambia_1",    "Rotavirus Liu
#2-Rotavirus_Liu_2",
"Enterovirus #2 Br-Enterovirus_2_Br",    "Hepatitis A
#1-Hepatitis_A_1",    "HEV #2-HEV_2",    "MS2
control-MS2_control",
"Rotarix NSP2 Bris-Rotarix_NSP2_Bri",    "CMV br-CMV_br",   
"Rnase P
control-Rnase_P_control",    "Salmonella hil A-Salmonella_hil_A",
"Shigella ipa H-Shigella_ipa_H",    "Enteroagg E.
coli-Enteroagg_E_coli",    "V parahaemolyticus-V_p_haemolyticus",
"Campy coli-Campy_coli",    "Yersinia
enterocolitica-Y_enterocolitica",    "Bacterial
16S-Bacterial_16S",
"Aeromonas hydrophilia-Aero_hydrophilia",    "Vibrio
cholerae-Vibrio_cholerae",    "Dientamoeba
fragilis-Dien_fragilis",
"Entamoeba histolytica-Enta_histolytica",    "Cryptosporidium spp
#2
J-Crypto_spp_2_J",    "Giardia lambia #2 rev-Giardia_lambia_r")
ct <- c(NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,
NA,    18.793,    NA,    NA,    NA,    NA,    NA,    NA,    33.302,
NA,    32.388,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,
   NA,    NA,    NA,    31.398,    NA,    NA,    NA,    NA,    NA,
NA,    NA,    NA,    NA,    8.115,    NA,    NA,    NA,    NA,    NA,
  NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,
 NA,    21.161,    NA,    NA,    NA,    NA,    NA,    NA,    31.302,
 NA,    29.785,    NA,    NA,    NA,    NA,    NA,    NA,    NA,
NA,    NA,    NA,    NA,    31.212,    42.967,    NA,    33.503,
NA,    NA,    NA,    NA,    NA,    NA,    9.584,    NA,    NA,    NA,
  NA,    NA,    NA)

good.data <- data.frame(sample, target, ct, stringsAsFactors = FALSE)

and the structure of these object is the same:> str(good.data)'data.frame':    96 obs. of  3 variables:
 $ sample: chr  "p.001" "p.001" "p.001"
"p.001" ...
 $ target: chr  "Adeno 1-Adeno 1" "Adeno 40/41 EH-AIQJCT3"
"Astro
1-Astro 1" "Sapo 1-Sapo 1" ...
 $ ct    : num  NA NA NA NA NA NA NA NA NA NA ...> str(bad.data)'data.frame':    96 obs. of  3 variables:
 $ Sample: chr  "p.001" "p.001" "p.001"
"p.001" ...
 $ Target: chr  "Adeno 1-Adeno 1" "Adeno 40/41 EH-AIQJCT3"
"Astro
1-Astro 1" "Sapo 1-Sapo 1" ...
 $ Ct    : num  NA NA NA NA NA NA NA NA NA NA ...

however in the good.data case the problem with truncation does not
occur, so for instance I get the required "Astro #1-Astro_1",
"Sapo
#1-Sapo_1" and "Sapo #2-Sapo_2 ".
The problem must therefore be in the format of the txt file and the
read function, possibly in the # character present in the names.
Could somebody explain me what such problem is and how to avoid it?
Many thanks
Best regards
Luigi

Duncan Murdoch

2015-Sep-27 12:19 UTC

head link

[R] Truncation of character fields in a data frame

On 27/09/2015 7:56 AM, Luigi Marongiu wrote:> Dear all,
> I am reading a txt file into the R environment to create a data frame,
> however I have notice that some entries have a truncated version of a
> field, so for instance I get "Astro" instead of "Astro
1-Astro 1" and
> "Sapo" for "Sapo #1-Sapo_1" and "Sapo
#2-Sapo_2", but I also get
> "Adeno 40/41 EH-Adeno_40-41_EH" so the problem is not in the
spaces
> between the words. The txt file is a simple tab delimited file
> generated from excel which I read with:
> 
> bad.data<-read.table(
>     "test_df.txt",
>     header=TRUE,
>     row.names=1,
>     dec = ".",
>     sep="\t",
>     stringsAsFactors = FALSE,
>     fill = TRUE
> )
> 
> [the fill = TRUE was introduced because in the real case I got an
> error of a missing line.]
See the "comment.char" argument to read.table.  By default the
"#"
character marks a comment, as in R code.

Duncan Murdoch
> 
> I can recreate this file as follows:
> sample <- c(rep("p.001", 48), rep("p.547", 48))
> target <- c("Adeno 1-Adeno 1",    "Adeno 40/41
EH-AIQJCT3",    "Astro
> 1-Astro 1",    "Sapo 1-Sapo 1",    "Sapo 2-Sapo
2",    "Enterovirus
> 1-Enterovirus 1",    "Parechovirus-Parechovirus",   
"HEV 1-HEV 1",
> "IC PDV control-AIRSA0B",    "Rotavirus cam-Rotavirus
cam",
> "18S-Hs99999901_s1",    "Noro gp II-Noro gp II",   
"Noro gp 1-Noro gp
> 1",    "Noro gp 1 mod33-Noro gp 1 mod33",    "C
difficile
> GDH-AIS086J",    "C difficile Tox B-C difficile Tox B",   
"VTX
> 1-AIT97CR",    "BT control Man-AIVI5IZ",    "E. coli
vtx 2-E. coli vtx
> 2",    "Campy spp-AIWR3O7",    "Salmonella
ttr-AIX01VF",    "Crypto
> CP2-AIY9Z1N",    "Green Fluorescent Protein-AI0IX7V",   
"Adeno
> 2-Adeno 2",    "Adeno 40_41 Oly-AI1RWD3",    "Astro 2
Liu-AI20UKB",
> "Giardia lambia 1-AI39SQJ",    "Rotavirus Liu-Rotavirus Liu
2",
> "Enterovirus Bruges-Enterovirus 2 Br",    "HAV 1-Hepatitis A
1",
> "HEV 2-AI5IQWR",    "MS2 control-AI6RO2Z",   
"Rotarix NSP2-AI70M87",
>   "CMV br-CMV br",    "IC Rnase P-AI89LFF",   
"Salmonella hil
> A-Salmonella hil A",    "Shigella ipa H-AIAA0K8",   
"Enteroagg E.
> coli-AIBJYRG",    "Campy jejuni-AICSWXO",    "Campy
coli-AID1U3W",
> "Yersinia enterocolitica-AIFAS94",    "Bacterial
16S-Bacterial 16S",
>  "Aeromonas hydrophilia-Aeromonas hydrophilia",    "V
> cholerae-AIGJRGC",    "Dientamoeba fragilis-AIHSPMK",   
"Entamoeba
> histolytica-AII1NSS",    "Crypto 2 J-AIKALY0",   
"Giardia lambia
> rev-AILJJ48",    "Adeno #1-Adeno_1",    "Adeno 40/41
> EH-Adeno_40-41_EH",    "Astro #1-Astro_1",    "Sapo
#1-Sapo_1",
> "Sapo #2-Sapo_2",    "Enterovirus #1-Enterovirus_1",
> "Parechovirus-Parechovirus",    "HEV #1-HEV_1",   
"C coli jejuni
> Liu-C_coli_jejuni_Li",    "Rotavirus cam-Rotavirus_cam",   
"IC 18s-IC
> 18s",    "Noro gp II-Noro_gp_II",    "Noro gp
1-Noro_gp_1",    "Noro
> gp 1 mod33-Noro_gp_1_mod33",    "C difficile
GDH-C-difficile_GDH",
> "C difficile Tox B-C_difficile_T_B",    "E. coli vtx
1-E_coli_vtx_1",
>   "BT control Man-BT_control_Man",    "E. coli vtx
2-E_coli_vtx_2",
> "Campy spp NEW-Campy_spp_NEW",    "Salmonella
ttr-Salmonella_ttr",
> "Cryptosporidium spp CP2-Cryptos_spp_CP2",    "C jejuni
> #2-C_jejuni_2",    "Adeno #2-Adeno_2",    "Adeno 40/41
> Oly-Adeno_40-41_Oly",    "Astro Liu #2-Astro_Liu_2",   
"Giardia
> lambia #1-Giardia_lambia_1",    "Rotavirus Liu
#2-Rotavirus_Liu_2",
> "Enterovirus #2 Br-Enterovirus_2_Br",    "Hepatitis A
> #1-Hepatitis_A_1",    "HEV #2-HEV_2",    "MS2
control-MS2_control",
> "Rotarix NSP2 Bris-Rotarix_NSP2_Bri",    "CMV
br-CMV_br",    "Rnase P
> control-Rnase_P_control",    "Salmonella hil
A-Salmonella_hil_A",
> "Shigella ipa H-Shigella_ipa_H",    "Enteroagg E.
> coli-Enteroagg_E_coli",    "V
parahaemolyticus-V_p_haemolyticus",
> "Campy coli-Campy_coli",    "Yersinia
> enterocolitica-Y_enterocolitica",    "Bacterial
16S-Bacterial_16S",
> "Aeromonas hydrophilia-Aero_hydrophilia",    "Vibrio
> cholerae-Vibrio_cholerae",    "Dientamoeba
fragilis-Dien_fragilis",
> "Entamoeba histolytica-Enta_histolytica",   
"Cryptosporidium spp #2
> J-Crypto_spp_2_J",    "Giardia lambia #2
rev-Giardia_lambia_r")
> ct <- c(NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,
> NA,    18.793,    NA,    NA,    NA,    NA,    NA,    NA,    33.302,
> NA,    32.388,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,
>    NA,    NA,    NA,    31.398,    NA,    NA,    NA,    NA,    NA,
> NA,    NA,    NA,    NA,    8.115,    NA,    NA,    NA,    NA,    NA,
>   NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,    NA,
>  NA,    21.161,    NA,    NA,    NA,    NA,    NA,    NA,    31.302,
>  NA,    29.785,    NA,    NA,    NA,    NA,    NA,    NA,    NA,
> NA,    NA,    NA,    NA,    31.212,    42.967,    NA,    33.503,
> NA,    NA,    NA,    NA,    NA,    NA,    9.584,    NA,    NA,    NA,
>   NA,    NA,    NA)
> 
> good.data <- data.frame(sample, target, ct, stringsAsFactors = FALSE)
> 
> and the structure of these object is the same:
>> str(good.data)
> 'data.frame':    96 obs. of  3 variables:
>  $ sample: chr  "p.001" "p.001" "p.001"
"p.001" ...
>  $ target: chr  "Adeno 1-Adeno 1" "Adeno 40/41
EH-AIQJCT3" "Astro
> 1-Astro 1" "Sapo 1-Sapo 1" ...
>  $ ct    : num  NA NA NA NA NA NA NA NA NA NA ...
>> str(bad.data)
> 'data.frame':    96 obs. of  3 variables:
>  $ Sample: chr  "p.001" "p.001" "p.001"
"p.001" ...
>  $ Target: chr  "Adeno 1-Adeno 1" "Adeno 40/41
EH-AIQJCT3" "Astro
> 1-Astro 1" "Sapo 1-Sapo 1" ...
>  $ Ct    : num  NA NA NA NA NA NA NA NA NA NA ...
> 
> however in the good.data case the problem with truncation does not
> occur, so for instance I get the required "Astro #1-Astro_1",
"Sapo
> #1-Sapo_1" and "Sapo #2-Sapo_2 ".
> The problem must therefore be in the format of the txt file and the
> read function, possibly in the # character present in the names.
> Could somebody explain me what such problem is and how to avoid it?
> Many thanks
> Best regards
> Luigi
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

R help - Sep 2015 - Truncation of character fields in a data frame

[R] Truncation of character fields in a data frame

[R] Truncation of character fields in a data frame