Hi All, Using R for text processing is quite new to me, while I have found a lot of useful functions and I'm beginning to learn regex, I need help with the following task. How do I calculate the distance between words? That is, given a specific keyword, I need to assign labels to the other words based on the distance (number of words) to this keyword. For example, if the keyword is "amet" and the string of words is: "Lorem ipsum dolor sit amet, consectetur adipiscing elit." -> "dolor" would get a value of -2 -> "elit" would get a value of 3 If the sentence contains more than one instance of the keyword, I need values for each instance. Moreover, one can assume that I can split my data into sentences, so there is no need to search and recognize sentences (this is a separate problem). Thank you! Best regards, Jay [[alternative HTML version deleted]]
> On Nov 6, 2015, at 3:28 AM, Karl <josip.2000 at gmail.com> wrote: > > Hi All, > > Using R for text processing is quite new to me, while I have found a lot of > useful functions and I'm beginning to learn regex, I need help with the > following task. How do I calculate the distance between words? > > That is, given a specific keyword, I need to assign labels to the other > words based on the distance (number of words) to this keyword. > > For example, if the keyword is "amet" and the string of words isstrng <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit.?> -> "dolor" would get a value of -2 > -> "elit" would get a value of 3words <- unlist(strsplit(strng, "\\W")) words[words != ""] #[1] "Lorem" "ipsum" "dolor" "sit" #[5] "amet" "consectetur" "adipiscing" "elit" real <- words[words != ?"] which(real == "amet") #[1] 5 length(real) #[1] 8 vec <- 1:length(real) - which(real == "amet") names(vec) <- real vec["dolor"] #dolor # -2> # > If the sentence contains more than one instance of the keyword, I need > values for each instance. Moreover, one can assume that I can split my data > into sentences, so there is no need to search and recognize sentences (this > is a separate problem). > > Thank you! > > Best regards, > Jay > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
> -----Original Message----- > From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Karl > Subject: [R] Calculating distance between words in string > > .. given a specific keyword, I need to assign labels to the other words > based on the distance (number of words) to this keyword. > >... > If the sentence contains more than one instance of the keyword, I need values > for each instance.What would you like to happen when the sentence contains more than one instance of other words and more than one instance of both? e.g. what output do you want from " amet is not the only instance of 'amet', and there is more than one instance of 'instance', 'is', 'of' and 'and'." S Ellison ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
Perhaps what you are seeking is a sparse distance matrix. "How far is each word from every other matching word" sentence<-"How far is each word from every other matching word" words<-tolower(unlist(strsplit(sentence," "))) nwords<-length(words) wdm<-matrix(NA,nrow=nwords,ncol=nwords) for(word in 1:nwords) { wordmatch<-grep(words[word],words,fixed=TRUE) wdm[word,wordmatch]<-wordmatch-word } rownames(wdm)<-colnames(wdm)<-words wdm The result contains zeros for a self-match, relative positions for the desired matches and NA for non-matches. Jim On Thu, Nov 12, 2015 at 12:15 AM, S Ellison <S.Ellison at lgcgroup.com> wrote:> > -----Original Message----- > > From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Karl > > Subject: [R] Calculating distance between words in string > > > > .. given a specific keyword, I need to assign labels to the other words > > based on the distance (number of words) to this keyword. > > > >... > > If the sentence contains more than one instance of the keyword, I need > values > > for each instance. > > What would you like to happen when the sentence contains more than one > instance of other words and more than one instance of both? > > e.g. what output do you want from > " amet is not the only instance of 'amet', and there is more than one > instance of 'instance', 'is', 'of' and 'and'." > > > S Ellison > > > ******************************************************************* > This email and any attachments are confidential. Any u...{{dropped:13}}