Sabina Arndt
2012-May-22 10:08 UTC
[R] How to remove square brackets, etc. from address strings?
Hello, I'd like to remove the individual pairs of square brackets along with their content - plus the space directly behind it - from address strings such as this: [Swidsinski, Alexander; Loening-Baucke, Vera; Lochs, Herbert] Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; [Hale, Laura P.] Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA I'd like get the following result: Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA I tried address = gsub("(.*)[(.*)]", "\\2", address) But this deletes everything from the first opening bracket to the last closing bracket and leaves only the very last address: Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA How can I remove only the individual pairs of square brackets along with their content? Thank you very much in advance! [[alternative HTML version deleted]]
Sarah Goslee
2012-May-22 12:39 UTC
[R] How to remove square brackets, etc. from address strings?
Hi Sabina, You've run into two characteristics of regular expressions: [ ] are special characters * is a greedy match Reading an intro regular expression document will help with both of those. Meanwhile:> x <- "[Swidsinski, Alexander; Loening-Baucke, Vera; Lochs, Herbert] Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; [Hale, Laura P.] Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA" > x[1] "[Swidsinski, Alexander; Loening-Baucke, Vera; Lochs, Herbert] Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; [Hale, Laura P.] Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA"> gsub("\\[.*?\\] ", "", x) # escape [ and ] and make * lazy instead of greedy[1] "Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA" Sarah On Tue, May 22, 2012 at 6:08 AM, Sabina Arndt <sabina.arndt at hotmail.de> wrote:> > Hello, > > > > I'd like to remove the individual pairs of square brackets along with > their content - plus the space directly behind it - from address strings > ?such as this: > > > ?[Swidsinski, Alexander; Loening-Baucke, Vera; Lochs, Herbert] Charite > Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; [Hale, Laura P.] > Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA > > > > ?I'd like get the following result: > > > > ?Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA > > > > I tried > > > ?address = gsub("(.*)[(.*)]", "\\2", address) > > > > But this deletes everything from the first opening bracket to the last closing bracket and leaves only the very last address: > > > ?Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA > > > > How can I remove only the individual pairs of square brackets along with their content? > > > > Thank you very much in advance! > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Sarah Goslee http://www.stringpage.com http://www.sarahgoslee.com http://www.functionaldiversity.org
Hi, ?text <- "[Swidsinski, Alexander; Loening-Baucke, Vera; Lochs, Herbert] Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; [Hale, Laura P.] Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA" ?gsub("\\[.+?]","",text) A.K. ----- Original Message ----- From: Sabina Arndt <sabina.arndt at hotmail.de> To: r-help at r-project.org Cc: Sent: Tuesday, May 22, 2012 6:08 AM Subject: [R] How to remove square brackets, etc. from address strings? Hello, I'd like to remove the individual pairs of square brackets along with their content - plus the space directly behind it - from address strings such as this: ? [Swidsinski, Alexander; Loening-Baucke, Vera; Lochs, Herbert] Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; [Hale, Laura P.] Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA I'd like get the following result: ? Charite Humboldt Univ, Innere Klin, D-10098 Berlin, Germany; Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA I tried ? address = gsub("(.*)[(.*)]", "\\2", address) But this deletes everything from the first opening bracket to the last closing bracket and leaves only the very last address: ? Duke Univ, Med Ctr, Dept Pathol, Durham, NC 27710 USA How can I remove only the individual pairs of square brackets along with their content? Thank you very much in advance! ??? ??? ??? ? ??? ??? ? ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Sabina Arndt
2012-May-25 20:31 UTC
[R] How to remove square brackets, etc. from address strings?
Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record: > C1s[1] [1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN" [4] "GERMANY" "DANIEL" "LESCA MIRIAM" [7] "GERMANY" "ANKE" "MATTHIAS" [10] "MATTHIAS" "GERMANY" "KERSTIN" [13] "GERMANY" "GERMANY" "[SCHEIDT, HOLGER A." [16] "JUERGEN" "GERMANY" "HUMBOLDT" [19] "GERMANY" For some reason the first and sixth pair of the eight square brackets are not removed ... Do you understand why? Instead I'd like to get this result, though: > C1s[1] [1] "GERMANY" "GERMANY" "GERMANY" [4] "GERMANY" "GERMANY" "GERMANY" [7] "HUMBOLDT" "GERMANY" What am I doing wrong? What are the errors in my R-script? Would anybody be so kind as to take a look and help me out, please? Thank you very much in advance! Faithfully yours, Sabina Arndt