Sachinthaka Abeywardana
2013-Jan-14 09:30 UTC
[R] Grabbing Specific Words from Content (basic text mining)
Hi all, Suppose I have a data frame with mixed content (name age and address). a<-"Name: John Smith Age: 35 Address: 32, street, sub, something" b<-data.frame(a) 1. The question is I want to extract the name age and address separately from this data frame (containing potentially more people). 2. Also just incase I have to deal with it how would the syntax change if I had "Name" as opposed to "Name:" (without the colon). Any thoughts are much appreciated. Thanks, Sachin [[alternative HTML version deleted]]
Oliver Keyes
2013-Jan-14 10:04 UTC
[R] Grabbing Specific Words from Content (basic text mining)
Total newb here, but you might want to check out ?grep and ?regmatches as a start. (ways to match and then extract substrings respectively) On Mon, Jan 14, 2013 at 9:30 AM, Sachinthaka Abeywardana < sachin.abeywardana@gmail.com> wrote:> Hi all, > > Suppose I have a data frame with mixed content (name age and address). > > a<-"Name: John Smith Age: 35 Address: 32, street, sub, something" > b<-data.frame(a) > > 1. The question is I want to extract the name age and > address separately from this data frame (containing potentially more > people). > > 2. Also just incase I have to deal with it how would the syntax change if I > had "Name" as opposed to "Name:" (without the colon). > > Any thoughts are much appreciated. > > Thanks, > Sachin > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Manjusha Joshi
2013-Jan-14 10:33 UTC
[R] Grabbing Specific Words from Content (basic text mining)
Hello,> > Suppose I have a data frame with mixed content (name age and address). > > a<-"Name: John Smith Age: 35 Address: 32, street, sub, something" > b<-data.frame(a) > > Since it is dataframe, assumption is you stored data in the columns asName, Age, Address, sub etc.> 1. The question is I want to extract the name age and > address separately from this data frame (containing potentially more > people). >a$Name will extract all data in the column "Name" similarly for other information. Which you can assign to other variable.> > 2. Also just incase I have to deal with it how would the syntax change if I > had "Name" as opposed to "Name:" (without the colon). >One can handle slightly different words with agrep command. How ever, your situation can be different. May be this is what you can use to proceed further. You can use agrep command with option max.distance to extract or handle> >agrep("name",f,max.distance=0.1) or "Name|Name:" may work in your situation. -- Manjusha S. Joshi blog:http://manjushajoshi.wordpress.com/ [[alternative HTML version deleted]]
Gabor Grothendieck
2013-Jan-14 10:47 UTC
[R] Grabbing Specific Words from Content (basic text mining)
On Mon, Jan 14, 2013 at 4:30 AM, Sachinthaka Abeywardana <sachin.abeywardana at gmail.com> wrote:> Hi all, > > Suppose I have a data frame with mixed content (name age and address). > > a<-"Name: John Smith Age: 35 Address: 32, street, sub, something" > b<-data.frame(a) > > 1. The question is I want to extract the name age and > address separately from this data frame (containing potentially more > people). > > 2. Also just incase I have to deal with it how would the syntax change if I > had "Name" as opposed to "Name:" (without the colon). >Try this:> library(gsubfn) > > a <- "Name: John Smith Age: 35 Address: 32, street, sub, something" > b <- data.frame(a) > strapplyc(as.character(b$a), "Name: (.*) Age: (.*) Address: (.*)")[[1]] [1] "John Smith" "35" [3] "32, street, sub, something"> > > a. <- "Name John Smith Age 35 Address 32, street, sub, something" > b. <- data.frame(a.) > strapplyc(as.character(b.$a.), "Name (.*) Age (.*) Address (.*)")[[1]] [1] "John Smith" "35" [3] "32, street, sub, something" -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
HI, YOu could do either: Lines<-readLines(textConnection("Name: John Smith Age: 35 Address: 32, street, sub, something Name Adam Grey Age: 25 Address: 26, street, sub, something"))?? ?Lines[-grep("Name\\:",Lines)]<-gsub("Name","Name:",Lines[-grep("Name\\:",Lines)]) ?Name<-gsub("Name\\: (.*) Age\\: (.*) Address\\: (.*)","\\1",Lines) ?age<-gsub("Name\\: (.*) Age\\: (.*) Address\\: (.*)","\\2",Lines) ?Address<-gsub("Name\\: (.*) Age\\: (.*) Address\\: (.*)","\\3",Lines) ?dat1<-data.frame(Name,age,Address,stringsAsFactors=FALSE) ?dat1 dat1 ?# ????? Name age??????????????????? Address #1 John Smith? 35 32, street, sub, something #2? Adam Grey? 25 26, street, sub, something #or ?Lines[-grep("Name\\:",Lines)]<-gsub("Name","Name:",Lines[-grep("Name\\:",Lines)]) res<-read.table(text=gsub("Name|Age|Address","",Lines),sep=":",stringsAsFactors=F)[-1] res[sapply(res,is.character)]<-do.call(cbind,lapply(res[sapply(res,is.character)],function(x) sub("^[[:space:]]*(.*?)[[:space:]]*$","\\1",x))) ?str(res) #'data.frame':??? 2 obs. of? 3 variables: # $ V2: chr? "John Smith" "Adam Grey" # $ V3: num? 35 25 # $ V4: chr? "32, street, sub, something" "26, street, sub, something" A.K. ----- Original Message ----- From: Sachinthaka Abeywardana <sachin.abeywardana at gmail.com> To: "r-help at r-project.org" <r-help at r-project.org> Cc: Sent: Monday, January 14, 2013 4:30 AM Subject: [R] Grabbing Specific Words from Content (basic text mining) Hi all, Suppose I have a data frame with mixed content (name age and address). a<-"Name: John Smith Age: 35 Address: 32, street, sub, something" b<-data.frame(a) 1. The question is I want to extract the name age and address separately from this data frame (containing potentially more people). 2. Also just incase I have to deal with it how would the syntax change if I had "Name" as opposed to "Name:" (without the colon). Any thoughts are much appreciated. Thanks, Sachin ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.