bob stoner
2010-Dec-30 18:03 UTC
[R] Analysing Character Strings for subsequent frequency analysis
Hi I'm trying to get to grips with R and establish R as a teaching medium in my secondary school. I would like to use R to analyse text so I can produce frequency analysis of the text for subsequent examination of ciphers. I can produce code in VBA but I am struggling when writing in R to examine each character. There must be a clear method using the vectorised format of R. Furthermore, how do you substr a text string and reference each letter? I can use nchar to see how many letters per string but not to select each letter. I would prefer to remain in R and not deviate to Python etc as getting R onto the school mainframe has been a long journey... Many thanks Bob Stoner Sleaford, Lincolnshire, UK
Marc Schwartz
2010-Dec-30 20:59 UTC
[R] Analysing Character Strings for subsequent frequency analysis
On Dec 30, 2010, at 12:03 PM, bob stoner wrote:> Hi > I'm trying to get to grips with R and establish R as a teaching medium in my secondary school. I would like to use R to analyse text so I can produce frequency analysis of the text for subsequent examination of ciphers. I can produce code in VBA but I am struggling when writing in R to examine each character. There must be a clear method using the vectorised format of R. Furthermore, how do you substr a text string and reference each letter? I can use nchar to see how many letters per string but not to select each letter. I would prefer to remain in R and not deviate to Python etc as getting R onto the school mainframe has been a long journey... > Many thanks > > Bob Stoner > Sleaford, Lincolnshire, UKThere are likely to be some text analysis packages on CRAN, but taking a basic approach to generating a frequency table of characters in a vector: Vec <- "The lazy brown fox" # See ?strsplit, which returns a list> strsplit(Vec, "")[[1]] [1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o" [18] "x" # Get the first list element> strsplit(Vec, "")[[1]][1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o" [18] "x" # Where are the o's in the vector?> which(strsplit(Vec, "")[[1]] == "o")[1] 12 17 # generate the frequency table of letters> table(strsplit(Vec, "")[[1]])a b e f h l n o r T w x y z 3 1 1 1 1 1 1 1 2 1 1 1 1 1 1 Now, let's say that Vec has multiple elements, perhaps the result of using readLines() on a text file: Vec <- c("The lazy brown fox", "jumped over the fence")> strsplit(Vec, "")[[1]] [1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o" [18] "x" [[2]] [1] "j" "u" "m" "p" "e" "d" " " "o" "v" "e" "r" " " "t" "h" "e" " " "f" [18] "e" "n" "c" "e" # Use lapply() to loop over each list element returned by strsplit() # generating a frequency table for each> lapply(strsplit(Vec, ""), table)[[1]] a b e f h l n o r T w x y z 3 1 1 1 1 1 1 1 2 1 1 1 1 1 1 [[2]] c d e f h j m n o p r t u v 3 1 1 5 1 1 1 1 1 1 1 1 1 1 1 # Get the first 4 letters in each # See ?substr> substr(Vec, 1, 4)[1] "The " "jump" HTH, Marc Schwartz
Gabor Grothendieck
2010-Dec-30 21:17 UTC
[R] Analysing Character Strings for subsequent frequency analysis
On Thu, Dec 30, 2010 at 1:03 PM, bob stoner <bob.stoner at btinternet.com> wrote:> Hi > I'm trying to get to grips with R and establish R as a teaching medium in my > secondary school. I would like to use R to analyse text so I can produce > frequency analysis of the text for subsequent examination of ciphers. I can > produce code in VBA but I am struggling when writing in R to examine each > character. There must be a clear method using the vectorised format of R. > Furthermore, how do you substr a text string and reference each letter? I > can use nchar to see how many letters per string but not to select each > letter. I would prefer to remain in R and not deviate to Python etc as > getting R onto the school mainframe has been a long journey... > Many thanks > > Bob Stoner > Sleaford, Lincolnshire, UKGoogle for: CRAN Task View on Natural Language Processing for an overview of the addon packages for analyzing text. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com