Jennifer Sabatier
2015-Nov-02 18:39 UTC
[R] Locating the starting position of the first number in a string
Hi, So, I've got a vector of strings that look like this: ID <- c("IBBS3_MSM_HN01209","IBBS3_MSM_HN01210","IBBS3_MSM_HN01211", "IBBS3_MSM_HN10212","IBBS3_MSM_HN104213","IBBS3_MSM_HN10214", "IBBS3_MSM_HN44215","IBBS3_MSM_HN44216","IBBS3_MSM_HN44217", "IBBS3_MSM_HN44218","IBBS3_MSM_HN44219","IBBS3_MSM_HN44220", "IBBS3_MSM_HN44221","IBBS3_MSM_HN44222","IBBS3_MSM_HN44223", "IBBS3_MSM_HN44224","IBBS3_MSM_HN44225","IBBS3_MSM_HN44226", "IBBS3_MSM_HN44227","IBBS3_MSM_HN12228","IBBS3_MSM_HN12229", "IBBS3_MSM_HN12230","IBBS3_MSM_HN12231","IBBS3_MSM_HN12232", "IBBS3_MSM_HN12233","IBBS3_MSM_HN12234","IBBS3_MSM_HN12235", "IBBS3_MSM_HN12236","IBBS3_MSM_HN12237","IBBS3_MSM_HN12238", "IBBS3_MSM_HN12239","IBBS3_MSM_HN12240","IBBS3_MSM_HN12241", "IBBS3_MSM_HN12242","IBBS3_MSM_HN12243","IBBS3_MSM_HN12244", "IBBS3_MSM_HN12245","IBBS3_MSM_HN12246","IBBS3_MSM_HN12247", "IBBS3_MSM_HN12248","IBBS3_MSM_HN12249","IBBS3_MSM_HN12250", "IBBS3_MSM_HN12251","IBBS3_MSM_HN12252","IBBS3_MSM_HN12253", "IBBS3_MSM_HN12254","IBBS3_MSM_HN12255","IBBS3_MSM_HN25256", "IBBS3_MSM_HN25257","IBBS3_MSM_HN25258","IBBS3_MSM_HN25259", "IBBS3_MSM_HN25260","IBBS3_MSM_HN25261","IBBS3_MSM_HN25262", "IBBS3_MSM_HN25263","IBBS3_MSM_HN25264","IBBS3_MSM_HN25265", "IBBS3_MSM_HN25266","IBBS3_MSM_HN25267","IBBS3_MSM_HN25268", "IBBS3_MSM_HN25269","IBBS3_MSM_HN25270","IBBS3_MSM_HN25271", "IBBS3_MSM_HN25272","IBBS3_MSM_HN25273","IBBS3_MSM_HN25274", "IBBS3_MSM_HN25275","IBBS3_MSM_HN25276", "IBBS3_MSM_HN25277", "IBBS3_MSM_HN25278","IBBS3_MSM_HN25279","IBBS3_MSM_HN25280", "IBBS3_MSM_HN25281","IBBS3_MSM_HN25282","IBBS3_MSM_HN25283", "IBBS3_MSM_HN25284","IBBS3_MSM_HMC44285", "IBBS3_MSM_HMC44286", "IBBS3_MSM_HMC44287","IBBS3_MSM_HMC44288","IBBS3_MSM_HMC44289", "IBBS3_MSM_HMC44290","IBBS3_MSM_HMC44291","IBBS3_MSM_HMC44292", "IBBS3_MSM_HMC44293","IBBS3_MSM_HMC44294","IBBS3_MSM_HMC44295", "IBBS3_MSM_HMC44296","IBBS3_MSM_HMC44297","IBBS3_MSM_HMC44298", "IBBS3_MSM_HMC44299","IBBS3_MSM_HMC44300","IBBS3_MSM_HMC44301", "IBBS3_MSM_HMC44302","IBBS3_MSM_HMC44303","IBBS3_MSM_HMC44304", "IBBS3_MSM_HMC44305","IBBS3_MSM_HMC44306","IBBS3_MSM_HMC44307", "IBBS3_MSM_HMC44309") This is an ID that is in the following format: IBBS3_Type_Group##### What I want to do is locate the starting position of Type, which is anywhere from 3 to 4 letters long (in this example it's either MSM or PWID), the starting position of Group which is 2-3 letters long (either HN or HMC), and finally the starting position of the 5-digit number. I'm able to get Type and Group using the following: TYPE_s <- sapply(c("MSM", "PWID"), regexpr, ID, ignore.case=T) GROUP_s <- (sapply(c("HN", "HMC"), regexpr, ID, ignore.case=T)) What I am having trouble with is getting the starting position of the 5-digit number. I am trying: DIGITS_s <- sapply("([0:9])", regexpr, ID, ignore.case=T) But that just seems to look for the position of the first 0.:> DIGITS_s([0:9]) [1,] 13 [2,] 13 [3,] 13 [4,] 14 [5,] 14 [6,] 14 [7,] -1 [8,] -1 [9,] -1 [10,] -1 [11,] 17 [12,] 17 [13,] -1 [14,] -1 [15,] -1 [16,] -1 [17,] -1 [18,] -1 [19,] -1 [20,] -1 [21,] 17 [22,] 17 [23,] -1 [24,] -1 [25,] -1 [26,] -1 [27,] -1 [28,] -1 [29,] -1 [30,] -1 [31,] 17 [32,] 17 [33,] -1 [34,] -1 [35,] -1 [36,] -1 [37,] -1 [38,] -1 [39,] -1 [40,] -1 [41,] 17 [42,] 17 [43,] -1 [44,] -1 [45,] -1 [46,] -1 [47,] -1 [48,] -1 [49,] -1 [50,] -1 [51,] 17 [52,] 17 [53,] -1 [54,] -1 [55,] -1 [56,] -1 [57,] -1 [58,] -1 [59,] -1 [60,] -1 [61,] 17 [62,] 17 [63,] -1 [64,] -1 [65,] -1 [66,] -1 [67,] -1 [68,] -1 [69,] -1 [70,] -1 [71,] 17 [72,] 17 [73,] -1 [74,] -1 [75,] -1 [76,] -1 [77,] -1 [78,] -1 [79,] -1 [80,] -1 [81,] 18 [82,] 17 [83,] 17 [84,] 17 [85,] 17 [86,] 17 [87,] 17 [88,] 17 [89,] 17 [90,] 17 [91,] 17 [92,] 17 [93,] 17 [94,] 17 [95,] 17 [96,] 17 [97,] 17 [98,] 17 [99,] 17 [100,] 17 So, clearly, this is wrong. I just would like to find the starting position of the first digit, no matter what it is. It's probably easy, isn't it? Best, Jen [[alternative HTML version deleted]]
Peter Alspach
2015-Nov-02 20:32 UTC
[R] Locating the starting position of the first number in a string
Tena koe Jen Not answering your question: if you are after these locations in order to split the IDs in columns, then you might like to consider strsplit; e.g., t(sapply(strsplit(ID, '_'), rbind)) You could then split the last column. You state that there is a 5-digit number at the end. If this is correct, then use this feature (i.e., nchar(ID)-4) as you'd want "IBBS3_MSM_HN104213" (the fifth element in ID) to split to IBBS3, MSM, HN1 and 04213. However, if it isn't always 5 digits then split at the first number (i.e., HN and 104213). HTH ..... Peter Alspach -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Jennifer Sabatier Sent: Tuesday, 3 November 2015 7:39 a.m. To: r-help at r-project.org Subject: [R] Locating the starting position of the first number in a string Hi, So, I've got a vector of strings that look like this: ID <- c("IBBS3_MSM_HN01209","IBBS3_MSM_HN01210","IBBS3_MSM_HN01211", "IBBS3_MSM_HN10212","IBBS3_MSM_HN104213","IBBS3_MSM_HN10214", "IBBS3_MSM_HN44215","IBBS3_MSM_HN44216","IBBS3_MSM_HN44217", "IBBS3_MSM_HN44218","IBBS3_MSM_HN44219","IBBS3_MSM_HN44220", "IBBS3_MSM_HN44221","IBBS3_MSM_HN44222","IBBS3_MSM_HN44223", "IBBS3_MSM_HN44224","IBBS3_MSM_HN44225","IBBS3_MSM_HN44226", "IBBS3_MSM_HN44227","IBBS3_MSM_HN12228","IBBS3_MSM_HN12229", "IBBS3_MSM_HN12230","IBBS3_MSM_HN12231","IBBS3_MSM_HN12232", "IBBS3_MSM_HN12233","IBBS3_MSM_HN12234","IBBS3_MSM_HN12235", "IBBS3_MSM_HN12236","IBBS3_MSM_HN12237","IBBS3_MSM_HN12238", "IBBS3_MSM_HN12239","IBBS3_MSM_HN12240","IBBS3_MSM_HN12241", "IBBS3_MSM_HN12242","IBBS3_MSM_HN12243","IBBS3_MSM_HN12244", "IBBS3_MSM_HN12245","IBBS3_MSM_HN12246","IBBS3_MSM_HN12247", "IBBS3_MSM_HN12248","IBBS3_MSM_HN12249","IBBS3_MSM_HN12250", "IBBS3_MSM_HN12251","IBBS3_MSM_HN12252","IBBS3_MSM_HN12253", "IBBS3_MSM_HN12254","IBBS3_MSM_HN12255","IBBS3_MSM_HN25256", "IBBS3_MSM_HN25257","IBBS3_MSM_HN25258","IBBS3_MSM_HN25259", "IBBS3_MSM_HN25260","IBBS3_MSM_HN25261","IBBS3_MSM_HN25262", "IBBS3_MSM_HN25263","IBBS3_MSM_HN25264","IBBS3_MSM_HN25265", "IBBS3_MSM_HN25266","IBBS3_MSM_HN25267","IBBS3_MSM_HN25268", "IBBS3_MSM_HN25269","IBBS3_MSM_HN25270","IBBS3_MSM_HN25271", "IBBS3_MSM_HN25272","IBBS3_MSM_HN25273","IBBS3_MSM_HN25274", "IBBS3_MSM_HN25275","IBBS3_MSM_HN25276", "IBBS3_MSM_HN25277", "IBBS3_MSM_HN25278","IBBS3_MSM_HN25279","IBBS3_MSM_HN25280", "IBBS3_MSM_HN25281","IBBS3_MSM_HN25282","IBBS3_MSM_HN25283", "IBBS3_MSM_HN25284","IBBS3_MSM_HMC44285", "IBBS3_MSM_HMC44286", "IBBS3_MSM_HMC44287","IBBS3_MSM_HMC44288","IBBS3_MSM_HMC44289", "IBBS3_MSM_HMC44290","IBBS3_MSM_HMC44291","IBBS3_MSM_HMC44292", "IBBS3_MSM_HMC44293","IBBS3_MSM_HMC44294","IBBS3_MSM_HMC44295", "IBBS3_MSM_HMC44296","IBBS3_MSM_HMC44297","IBBS3_MSM_HMC44298", "IBBS3_MSM_HMC44299","IBBS3_MSM_HMC44300","IBBS3_MSM_HMC44301", "IBBS3_MSM_HMC44302","IBBS3_MSM_HMC44303","IBBS3_MSM_HMC44304", "IBBS3_MSM_HMC44305","IBBS3_MSM_HMC44306","IBBS3_MSM_HMC44307", "IBBS3_MSM_HMC44309") This is an ID that is in the following format: IBBS3_Type_Group##### What I want to do is locate the starting position of Type, which is anywhere from 3 to 4 letters long (in this example it's either MSM or PWID), the starting position of Group which is 2-3 letters long (either HN or HMC), and finally the starting position of the 5-digit number. I'm able to get Type and Group using the following: TYPE_s <- sapply(c("MSM", "PWID"), regexpr, ID, ignore.case=T) GROUP_s <- (sapply(c("HN", "HMC"), regexpr, ID, ignore.case=T)) What I am having trouble with is getting the starting position of the 5-digit number. I am trying: DIGITS_s <- sapply("([0:9])", regexpr, ID, ignore.case=T) But that just seems to look for the position of the first 0.:> DIGITS_s([0:9]) [1,] 13 [2,] 13 [3,] 13 [4,] 14 [5,] 14 [6,] 14 [7,] -1 [8,] -1 [9,] -1 [10,] -1 [11,] 17 [12,] 17 [13,] -1 [14,] -1 [15,] -1 [16,] -1 [17,] -1 [18,] -1 [19,] -1 [20,] -1 [21,] 17 [22,] 17 [23,] -1 [24,] -1 [25,] -1 [26,] -1 [27,] -1 [28,] -1 [29,] -1 [30,] -1 [31,] 17 [32,] 17 [33,] -1 [34,] -1 [35,] -1 [36,] -1 [37,] -1 [38,] -1 [39,] -1 [40,] -1 [41,] 17 [42,] 17 [43,] -1 [44,] -1 [45,] -1 [46,] -1 [47,] -1 [48,] -1 [49,] -1 [50,] -1 [51,] 17 [52,] 17 [53,] -1 [54,] -1 [55,] -1 [56,] -1 [57,] -1 [58,] -1 [59,] -1 [60,] -1 [61,] 17 [62,] 17 [63,] -1 [64,] -1 [65,] -1 [66,] -1 [67,] -1 [68,] -1 [69,] -1 [70,] -1 [71,] 17 [72,] 17 [73,] -1 [74,] -1 [75,] -1 [76,] -1 [77,] -1 [78,] -1 [79,] -1 [80,] -1 [81,] 18 [82,] 17 [83,] 17 [84,] 17 [85,] 17 [86,] 17 [87,] 17 [88,] 17 [89,] 17 [90,] 17 [91,] 17 [92,] 17 [93,] 17 [94,] 17 [95,] 17 [96,] 17 [97,] 17 [98,] 17 [99,] 17 [100,] 17 So, clearly, this is wrong. I just would like to find the starting position of the first digit, no matter what it is. It's probably easy, isn't it? Best, Jen [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. The contents of this e-mail are confidential and may be ...{{dropped:14}}
Jeff Newmiller
2015-Nov-02 21:33 UTC
[R] Locating the starting position of the first number in a string
Also not answering your question directly, but may be provide some useful ideas or results:> library( gsubfn ) > > DF <- setNames( data.frame( t( strapply( ID+ , "^[^_]+_([A-Z]+)_([A-Z]+)([0-9]+)$" + , c + , simplify=TRUE + ) + ) + , stringsAsFactors = FALSE + ) + , c( "Type", "Group", "Number" ) + )> str( DF )'data.frame': 100 obs. of 3 variables: $ Type : chr "MSM" "MSM" "MSM" "MSM" ... $ Group : chr "HN" "HN" "HN" "HN" ... $ Number: chr "01209" "01210" "01211" "10212" ... On Tue, 3 Nov 2015, Peter Alspach wrote:> Tena koe Jen > > Not answering your question: if you are after these locations in order to split the IDs in columns, then you might like to consider strsplit; e.g., > > t(sapply(strsplit(ID, '_'), rbind)) > > You could then split the last column. You state that there is a 5-digit number at the end. If this is correct, then use this feature (i.e., nchar(ID)-4) as you'd want "IBBS3_MSM_HN104213" (the fifth element in ID) to split to IBBS3, MSM, HN1 and 04213. However, if it isn't always 5 digits then split at the first number (i.e., HN and 104213). > > HTH ..... > > Peter Alspach > > -----Original Message----- > From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Jennifer Sabatier > Sent: Tuesday, 3 November 2015 7:39 a.m. > To: r-help at r-project.org > Subject: [R] Locating the starting position of the first number in a string > > Hi, > > > So, I've got a vector of strings that look like this: > ID <- c("IBBS3_MSM_HN01209","IBBS3_MSM_HN01210","IBBS3_MSM_HN01211", > "IBBS3_MSM_HN10212","IBBS3_MSM_HN104213","IBBS3_MSM_HN10214", > "IBBS3_MSM_HN44215","IBBS3_MSM_HN44216","IBBS3_MSM_HN44217", > "IBBS3_MSM_HN44218","IBBS3_MSM_HN44219","IBBS3_MSM_HN44220", > "IBBS3_MSM_HN44221","IBBS3_MSM_HN44222","IBBS3_MSM_HN44223", > "IBBS3_MSM_HN44224","IBBS3_MSM_HN44225","IBBS3_MSM_HN44226", > "IBBS3_MSM_HN44227","IBBS3_MSM_HN12228","IBBS3_MSM_HN12229", > "IBBS3_MSM_HN12230","IBBS3_MSM_HN12231","IBBS3_MSM_HN12232", > "IBBS3_MSM_HN12233","IBBS3_MSM_HN12234","IBBS3_MSM_HN12235", > "IBBS3_MSM_HN12236","IBBS3_MSM_HN12237","IBBS3_MSM_HN12238", > "IBBS3_MSM_HN12239","IBBS3_MSM_HN12240","IBBS3_MSM_HN12241", > "IBBS3_MSM_HN12242","IBBS3_MSM_HN12243","IBBS3_MSM_HN12244", > "IBBS3_MSM_HN12245","IBBS3_MSM_HN12246","IBBS3_MSM_HN12247", > "IBBS3_MSM_HN12248","IBBS3_MSM_HN12249","IBBS3_MSM_HN12250", > "IBBS3_MSM_HN12251","IBBS3_MSM_HN12252","IBBS3_MSM_HN12253", > "IBBS3_MSM_HN12254","IBBS3_MSM_HN12255","IBBS3_MSM_HN25256", > "IBBS3_MSM_HN25257","IBBS3_MSM_HN25258","IBBS3_MSM_HN25259", > "IBBS3_MSM_HN25260","IBBS3_MSM_HN25261","IBBS3_MSM_HN25262", > "IBBS3_MSM_HN25263","IBBS3_MSM_HN25264","IBBS3_MSM_HN25265", > "IBBS3_MSM_HN25266","IBBS3_MSM_HN25267","IBBS3_MSM_HN25268", > "IBBS3_MSM_HN25269","IBBS3_MSM_HN25270","IBBS3_MSM_HN25271", > "IBBS3_MSM_HN25272","IBBS3_MSM_HN25273","IBBS3_MSM_HN25274", > "IBBS3_MSM_HN25275","IBBS3_MSM_HN25276", "IBBS3_MSM_HN25277", "IBBS3_MSM_HN25278","IBBS3_MSM_HN25279","IBBS3_MSM_HN25280", > "IBBS3_MSM_HN25281","IBBS3_MSM_HN25282","IBBS3_MSM_HN25283", > "IBBS3_MSM_HN25284","IBBS3_MSM_HMC44285", "IBBS3_MSM_HMC44286", "IBBS3_MSM_HMC44287","IBBS3_MSM_HMC44288","IBBS3_MSM_HMC44289", > "IBBS3_MSM_HMC44290","IBBS3_MSM_HMC44291","IBBS3_MSM_HMC44292", > "IBBS3_MSM_HMC44293","IBBS3_MSM_HMC44294","IBBS3_MSM_HMC44295", > "IBBS3_MSM_HMC44296","IBBS3_MSM_HMC44297","IBBS3_MSM_HMC44298", > "IBBS3_MSM_HMC44299","IBBS3_MSM_HMC44300","IBBS3_MSM_HMC44301", > "IBBS3_MSM_HMC44302","IBBS3_MSM_HMC44303","IBBS3_MSM_HMC44304", > "IBBS3_MSM_HMC44305","IBBS3_MSM_HMC44306","IBBS3_MSM_HMC44307", > "IBBS3_MSM_HMC44309") > > > > > This is an ID that is in the following format: IBBS3_Type_Group##### > > > What I want to do is locate the starting position of Type, which is anywhere from 3 to 4 letters long (in this example it's either MSM or PWID), the starting position of Group which is 2-3 letters long (either HN or HMC), and finally the starting position of the 5-digit number. > > > I'm able to get Type and Group using the following: > > > TYPE_s <- sapply(c("MSM", "PWID"), regexpr, ID, ignore.case=T) > > GROUP_s <- (sapply(c("HN", "HMC"), regexpr, ID, ignore.case=T)) > > > What I am having trouble with is getting the starting position of the 5-digit number. > > > I am trying: > > > DIGITS_s <- sapply("([0:9])", regexpr, ID, ignore.case=T) > > > But that just seems to look for the position of the first 0.: > > >> DIGITS_s > > ([0:9]) > > [1,] 13 > > [2,] 13 > > [3,] 13 > > [4,] 14 > > [5,] 14 > > [6,] 14 > > [7,] -1 > > [8,] -1 > > [9,] -1 > > [10,] -1 > > [11,] 17 > > [12,] 17 > > [13,] -1 > > [14,] -1 > > [15,] -1 > > [16,] -1 > > [17,] -1 > > [18,] -1 > > [19,] -1 > > [20,] -1 > > [21,] 17 > > [22,] 17 > > [23,] -1 > > [24,] -1 > > [25,] -1 > > [26,] -1 > > [27,] -1 > > [28,] -1 > > [29,] -1 > > [30,] -1 > > [31,] 17 > > [32,] 17 > > [33,] -1 > > [34,] -1 > > [35,] -1 > > [36,] -1 > > [37,] -1 > > [38,] -1 > > [39,] -1 > > [40,] -1 > > [41,] 17 > > [42,] 17 > > [43,] -1 > > [44,] -1 > > [45,] -1 > > [46,] -1 > > [47,] -1 > > [48,] -1 > > [49,] -1 > > [50,] -1 > > [51,] 17 > > [52,] 17 > > [53,] -1 > > [54,] -1 > > [55,] -1 > > [56,] -1 > > [57,] -1 > > [58,] -1 > > [59,] -1 > > [60,] -1 > > [61,] 17 > > [62,] 17 > > [63,] -1 > > [64,] -1 > > [65,] -1 > > [66,] -1 > > [67,] -1 > > [68,] -1 > > [69,] -1 > > [70,] -1 > > [71,] 17 > > [72,] 17 > > [73,] -1 > > [74,] -1 > > [75,] -1 > > [76,] -1 > > [77,] -1 > > [78,] -1 > > [79,] -1 > > [80,] -1 > > [81,] 18 > > [82,] 17 > > [83,] 17 > > [84,] 17 > > [85,] 17 > > [86,] 17 > > [87,] 17 > > [88,] 17 > > [89,] 17 > > [90,] 17 > > [91,] 17 > > [92,] 17 > > [93,] 17 > > [94,] 17 > > [95,] 17 > > [96,] 17 > > [97,] 17 > > [98,] 17 > > [99,] 17 > > [100,] 17 > > > So, clearly, this is wrong. I just would like to find the starting position of the first digit, no matter what it is. > > It's probably easy, isn't it? > > Best, > > Jen > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > The contents of this e-mail are confidential and may be ...{{dropped:14}} > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
Boris Steipe
2015-Nov-03 00:18 UTC
[R] Locating the starting position of the first number in a string
The regular expression you are looking for is \d{5} ... a "digit" repeated five times. Note that you have to escape the escape in an R string. But your example does not conform to the description: you have examples with six digit numbers: IBBS3_MSM_HN104213. If there is length variation, I would just search for \d+ (at least one) or \d{5,} (at least five) And even though you send a vector with some hundred elements, it doesn't actually contain the choices you are asking for ??? Finally, I'm not sure why you want the "starting" positions, rather than the keys you find. Your sample code is not at all how one does this. Define the three elements that you want to capture, put them in parentheses and evaluate the matches that regexec() returns. Also give us a smaller example, but one that contains all of the relevant cases. ID <- c( "IBBS3_MSM_HN01209", "IBBS3_PWID_HN01210", "IBBS3_MSM_HMC01211", "IBBS3_PWID_HMC10212") # now consider the regular expression: regexec(".+((MSM)|(PWID))_((HN)|(HMC))(\\d+)", ID[1]) # This is: # any character one or more times, # followed by either MSM OR PWID, # followed by an underscore, # followed by either HN OR HMC, # followed by one or more digits # Look at the result: it's a list. The first vector of each list element # gives you the starting positions, the second one gives you the match lengths. # Compare: regexec(".+((MSM)|(PWID))_((HN)|(HMC))(\\d+)", ID[3]) # Following the logic of the nested parentheses, # you are looking for matches in position 2, 5 and # 8 of your expression. result <- matrix(numeric(3 * length(ID)), ncol=3) colnames(result) <- c("TYPE", "GROUP", "ID") for (i in 1:length(ID)) { m <- regexec(".+((MSM)|(PWID))_((HN)|(HMC))(\\d+)", ID[i]) result[i,] <- m[[1]][c(2, 5, 8)] # write the three starting # positions into a row # of your matrix } # of course its trivial now to actually capture # the keys but that's not what you asked for... B. On Nov 2, 2015, at 1:39 PM, Jennifer Sabatier <plessthanpointohfive at gmail.com> wrote:> Hi, > > > So, I've got a vector of strings that look like this: > ID <- c("IBBS3_MSM_HN01209","IBBS3_MSM_HN01210","IBBS3_MSM_HN01211", > "IBBS3_MSM_HN10212","IBBS3_MSM_HN104213","IBBS3_MSM_HN10214", > "IBBS3_MSM_HN44215","IBBS3_MSM_HN44216","IBBS3_MSM_HN44217", > "IBBS3_MSM_HN44218","IBBS3_MSM_HN44219","IBBS3_MSM_HN44220", > "IBBS3_MSM_HN44221","IBBS3_MSM_HN44222","IBBS3_MSM_HN44223", > "IBBS3_MSM_HN44224","IBBS3_MSM_HN44225","IBBS3_MSM_HN44226", > "IBBS3_MSM_HN44227","IBBS3_MSM_HN12228","IBBS3_MSM_HN12229", > "IBBS3_MSM_HN12230","IBBS3_MSM_HN12231","IBBS3_MSM_HN12232", > "IBBS3_MSM_HN12233","IBBS3_MSM_HN12234","IBBS3_MSM_HN12235", > "IBBS3_MSM_HN12236","IBBS3_MSM_HN12237","IBBS3_MSM_HN12238", > "IBBS3_MSM_HN12239","IBBS3_MSM_HN12240","IBBS3_MSM_HN12241", > "IBBS3_MSM_HN12242","IBBS3_MSM_HN12243","IBBS3_MSM_HN12244", > "IBBS3_MSM_HN12245","IBBS3_MSM_HN12246","IBBS3_MSM_HN12247", > "IBBS3_MSM_HN12248","IBBS3_MSM_HN12249","IBBS3_MSM_HN12250", > "IBBS3_MSM_HN12251","IBBS3_MSM_HN12252","IBBS3_MSM_HN12253", > "IBBS3_MSM_HN12254","IBBS3_MSM_HN12255","IBBS3_MSM_HN25256", > "IBBS3_MSM_HN25257","IBBS3_MSM_HN25258","IBBS3_MSM_HN25259", > "IBBS3_MSM_HN25260","IBBS3_MSM_HN25261","IBBS3_MSM_HN25262", > "IBBS3_MSM_HN25263","IBBS3_MSM_HN25264","IBBS3_MSM_HN25265", > "IBBS3_MSM_HN25266","IBBS3_MSM_HN25267","IBBS3_MSM_HN25268", > "IBBS3_MSM_HN25269","IBBS3_MSM_HN25270","IBBS3_MSM_HN25271", > "IBBS3_MSM_HN25272","IBBS3_MSM_HN25273","IBBS3_MSM_HN25274", > "IBBS3_MSM_HN25275","IBBS3_MSM_HN25276", "IBBS3_MSM_HN25277", > "IBBS3_MSM_HN25278","IBBS3_MSM_HN25279","IBBS3_MSM_HN25280", > "IBBS3_MSM_HN25281","IBBS3_MSM_HN25282","IBBS3_MSM_HN25283", > "IBBS3_MSM_HN25284","IBBS3_MSM_HMC44285", "IBBS3_MSM_HMC44286", > "IBBS3_MSM_HMC44287","IBBS3_MSM_HMC44288","IBBS3_MSM_HMC44289", > "IBBS3_MSM_HMC44290","IBBS3_MSM_HMC44291","IBBS3_MSM_HMC44292", > "IBBS3_MSM_HMC44293","IBBS3_MSM_HMC44294","IBBS3_MSM_HMC44295", > "IBBS3_MSM_HMC44296","IBBS3_MSM_HMC44297","IBBS3_MSM_HMC44298", > "IBBS3_MSM_HMC44299","IBBS3_MSM_HMC44300","IBBS3_MSM_HMC44301", > "IBBS3_MSM_HMC44302","IBBS3_MSM_HMC44303","IBBS3_MSM_HMC44304", > "IBBS3_MSM_HMC44305","IBBS3_MSM_HMC44306","IBBS3_MSM_HMC44307", > "IBBS3_MSM_HMC44309") > > > > > This is an ID that is in the following format: IBBS3_Type_Group##### > > > What I want to do is locate the starting position of Type, which is > anywhere from 3 to 4 letters long (in this example it's either MSM or > PWID), the starting position of Group which is 2-3 letters long (either HN > or HMC), and finally the starting position of the 5-digit number. > > > I'm able to get Type and Group using the following: > > > TYPE_s <- sapply(c("MSM", "PWID"), regexpr, ID, ignore.case=T) > > GROUP_s <- (sapply(c("HN", "HMC"), regexpr, ID, ignore.case=T)) > > > What I am having trouble with is getting the starting position of the > 5-digit number. > > > I am trying: > > > DIGITS_s <- sapply("([0:9])", regexpr, ID, ignore.case=T) > > > But that just seems to look for the position of the first 0.: > > >> DIGITS_s > > ([0:9]) > > [1,] 13 > > [2,] 13 > > [3,] 13 > > [4,] 14 > > [5,] 14 > > [6,] 14 > > [7,] -1 > > [8,] -1 > > [9,] -1 > > [10,] -1 > > [11,] 17 > > [12,] 17 > > [13,] -1 > > [14,] -1 > > [15,] -1 > > [16,] -1 > > [17,] -1 > > [18,] -1 > > [19,] -1 > > [20,] -1 > > [21,] 17 > > [22,] 17 > > [23,] -1 > > [24,] -1 > > [25,] -1 > > [26,] -1 > > [27,] -1 > > [28,] -1 > > [29,] -1 > > [30,] -1 > > [31,] 17 > > [32,] 17 > > [33,] -1 > > [34,] -1 > > [35,] -1 > > [36,] -1 > > [37,] -1 > > [38,] -1 > > [39,] -1 > > [40,] -1 > > [41,] 17 > > [42,] 17 > > [43,] -1 > > [44,] -1 > > [45,] -1 > > [46,] -1 > > [47,] -1 > > [48,] -1 > > [49,] -1 > > [50,] -1 > > [51,] 17 > > [52,] 17 > > [53,] -1 > > [54,] -1 > > [55,] -1 > > [56,] -1 > > [57,] -1 > > [58,] -1 > > [59,] -1 > > [60,] -1 > > [61,] 17 > > [62,] 17 > > [63,] -1 > > [64,] -1 > > [65,] -1 > > [66,] -1 > > [67,] -1 > > [68,] -1 > > [69,] -1 > > [70,] -1 > > [71,] 17 > > [72,] 17 > > [73,] -1 > > [74,] -1 > > [75,] -1 > > [76,] -1 > > [77,] -1 > > [78,] -1 > > [79,] -1 > > [80,] -1 > > [81,] 18 > > [82,] 17 > > [83,] 17 > > [84,] 17 > > [85,] 17 > > [86,] 17 > > [87,] 17 > > [88,] 17 > > [89,] 17 > > [90,] 17 > > [91,] 17 > > [92,] 17 > > [93,] 17 > > [94,] 17 > > [95,] 17 > > [96,] 17 > > [97,] 17 > > [98,] 17 > > [99,] 17 > > [100,] 17 > > > So, clearly, this is wrong. I just would like to find the starting > position of the first digit, no matter what it is. > > It's probably easy, isn't it? > > Best, > > Jen > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.