Hi, I have a series of MS word files and each file contains plain text. From these texts I would like to extract only those elements (read: words) that are between square brackets. Example of a text: Most fundamentally, it has led to an effort to clarify the organizational form concept. According to them [see also Smith, Jones and Carroll 2002], categories emerge as audience members recognize dissimilarities among groups of consumers and label them as members of a common set [Nicol 2000]. Now I would like to get the following selection: see also Smith, Jones and Carroll 2002 Nicol 2000 Any ideas on how to do this? What would be the best way to import the text in R? The entire text as an element in a dataframe? Thank you very much! Best, Mathijs -- View this message in context: http://r.789695.n4.nabble.com/Select-elements-from-text-tp4323947p4323947.html Sent from the R help mailing list archive at Nabble.com.
how bout using read.table(... , sep=" "). That would give you a vector of single words. then grepl("\\[[9-z]+\\]",x) will return a boolean vector> x<-c('test','[bracket]','hi]','[blah','foo','[bar]') > grepl('\\[[9-z]+\\]',x)[1] FALSE TRUE FALSE FALSE FALSE TRUE> x[grepl('\\[[9-z]+\\]',x)][1] "[bracket]" "[bar]" You might need a more complex reg-ex to catch them all incase of ([citation]) instances for example. Justin On Tue, Jan 24, 2012 at 6:52 AM, mdvaan <mathijsdevaan@gmail.com> wrote:> Hi, > > I have a series of MS word files and each file contains plain text. From > these texts I would like to extract only those elements (read: words) that > are between square brackets. Example of a text: > > Most fundamentally, it has led to an effort to clarify the organizational > form concept. According to them [see also Smith, Jones and Carroll 2002], > categories emerge as audience members recognize dissimilarities among > groups > of consumers and label them as members of a common set [Nicol 2000]. > > Now I would like to get the following selection: > > see also Smith, Jones and Carroll 2002 > Nicol 2000 > > Any ideas on how to do this? What would be the best way to import the text > in R? The entire text as an element in a dataframe? Thank you very much! > > Best, > > Mathijs > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Select-elements-from-text-tp4323947p4323947.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Thanks for the quick response. I get the latter part, but reading the text from MS word into R is problematic. I am able to read in (scan) all unique elements (following sep=" ") from the text, but unable to past everything together again. Any id on how to solve this? It looks like this now: text<-scan("test.txt", character(0), sep = " ")> text[1] "Most" "fundamentally," "it" "has" [5] "led" "to" "an" "effort" [9] "to" "clarify" "the" "organizational" [13] "form" "concept." "According" "to" [17] "them" "[see" "also" "Smith," [21] "Jones" "and" "Carroll" "2002]," [25] "categories" "emerge" "as" "audience" [29] "members" "recognize" "dissimilarities" "among" [33] "groups" "of" "consumers" "and" [37] "label" "them" "as" "members" [41] "of" "a" "common" "set" [45] "[Nicol" "2000]." -- View this message in context: http://r.789695.n4.nabble.com/Select-elements-from-text-tp4323947p4325174.html Sent from the R help mailing list archive at Nabble.com.
paste(text, collapse = " ") Michael On Tue, Jan 24, 2012 at 3:41 PM, mdvaan <mathijsdevaan at gmail.com> wrote:> Thanks for the quick response. I get the latter part, but reading the text > from MS word into R is problematic. I am able to read in (scan) all unique > elements (following sep=" ") from the text, but unable to past everything > together again. Any id on how to solve this? It looks like this now: > > text<-scan("test.txt", character(0), sep = " ") > >> text > ?[1] "Most" ? ? ? ? ? ?"fundamentally," ?"it" ? ? ? ? ? ? ?"has" > ?[5] "led" ? ? ? ? ? ? "to" ? ? ? ? ? ? ?"an" ? ? ? ? ? ? ?"effort" > ?[9] "to" ? ? ? ? ? ? ?"clarify" ? ? ? ? "the" ? ? ? ? ? ? "organizational" > [13] "form" ? ? ? ? ? ?"concept." ? ? ? ?"According" ? ? ? "to" > [17] "them" ? ? ? ? ? ?"[see" ? ? ? ? ? ?"also" ? ? ? ? ? ?"Smith," > [21] "Jones" ? ? ? ? ? "and" ? ? ? ? ? ? "Carroll" ? ? ? ? "2002]," > [25] "categories" ? ? ?"emerge" ? ? ? ? ?"as" ? ? ? ? ? ? ?"audience" > [29] "members" ? ? ? ? "recognize" ? ? ? "dissimilarities" "among" > [33] "groups" ? ? ? ? ?"of" ? ? ? ? ? ? ?"consumers" ? ? ? "and" > [37] "label" ? ? ? ? ? "them" ? ? ? ? ? ?"as" ? ? ? ? ? ? ?"members" > [41] "of" ? ? ? ? ? ? ?"a" ? ? ? ? ? ? ? "common" ? ? ? ? ?"set" > [45] "[Nicol" ? ? ? ? ?"2000]." > > -- > View this message in context: http://r.789695.n4.nabble.com/Select-elements-from-text-tp4323947p4325174.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.