Dear R community, I have the following problem I hoped you could help me with. My data is save in thousand of files with a weird extension containing for numbers and a z. For example *.1405z. With list.files I managed to load this data into R. It looks like this (the row numbers are not in the original file): 35 :LATEST STAGE 3.60 FT AT 730 AM CST ON 0102 36 .ER ARCT2 0102 C DC200001020813/DH12/HGIFF/DIH6 37 :QPF FORECAST 6AM NOON 6PM MDNT 38 .E1 :0102: / 3.5/ 3.4/ 3.5 39 .E2 :0103: / 3.5/ 3.0/ 2.5/ 2.1 40 .E3 :0104: / 1.8/ 1.5/ 1.3/ 1.2 41 .E4 :0105: / 1.2/ 1.8/ 2.3/ 2.7 42 .E5 :0106: / 3.0/ 3.0/ 3.1/ 3.3 43 .E6 :0107: / 3.4 I need the table in rows 37 to 43 in a matrix, for example: 0201 NA 3.5 3.4 3.5 0103 3.5 3.0 2.5 2.1 0104 1.8 1.5 1.3 1.2 0105 1.2 1.8 2.3 2.7 0106 3.0 3.0 3.1 3.3 0107 3.4 NA NA NA Unfortunately the row numbers vary per file. I can call up each line with file[40,1] for line 40 for example. It returns: [1] .E3 :0104: / 1.8/ 1.5/ 1.3/ 1.2 38 Levels: .E1 :0102: / 3.5/ 3.4/ 3.5 ... So I have two problems really: 1. How do I detect the table in the file (resp. the line where the table starts)? 2. How do I break up each line to write the values into a matrix? Feel free to suggest an entirely different approach if you think that is helpful. Thanks a lot! Frauke -- View this message in context: http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4464423.html Sent from the R help mailing list archive at Nabble.com.
Can you at least provide a subset of 2 files so we can see how the data is really stored in the file and what the separators are between the 'columns' of data. Also how do you determine where the data actually starts for the rows that you want to pull off. This will aid in determining how to parse the data. On Sun, Mar 11, 2012 at 3:07 PM, frauke <fhoss at andrew.cmu.edu> wrote:> Dear R community, > > I have the following problem I hoped you could help me with. > > My data is save in thousand of files with a weird extension containing for > numbers and a z. For example *.1405z. With list.files I managed to load this > data into R. It looks like this (the row numbers are not in the original > file): > > 35 ? ? ? ? ? ? ? ? ? ? ? ? ? ? :LATEST STAGE ? ? 3.60 FT AT 730 AM CST ON > 0102 > 36 ? ? ? ? ? ? ? ? ? ? ? ? ?.ER ARCT2 ? ?0102 C > DC200001020813/DH12/HGIFF/DIH6 > 37 ? ? ? ? ? ? ? ? ? :QPF FORECAST ? ? ? ?6AM ? ? ? NOON ? ? ? ?6PM > MDNT > 38 ? ? ? ? ? ? ? ? ? .E1 :0102: ? ? ? ? ? ? ?/ ? ? ? 3.5/ ? ? ? 3.4/ > 3.5 > 39 ? ? ? ? ? ? ? ? ? .E2 :0103: ? / ? ? ? 3.5/ ? ? ? 3.0/ ? ? ? 2.5/ > 2.1 > 40 ? ? ? ? ? ? ? ? ? .E3 :0104: ? / ? ? ? 1.8/ ? ? ? 1.5/ ? ? ? 1.3/ > 1.2 > 41 ? ? ? ? ? ? ? ? ? .E4 :0105: ? / ? ? ? 1.2/ ? ? ? 1.8/ ? ? ? 2.3/ > 2.7 > 42 ? ? ? ? ? ? ? ? ? .E5 :0106: ? / ? ? ? 3.0/ ? ? ? 3.0/ ? ? ? 3.1/ > 3.3 > 43 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?.E6 :0107: ? / > 3.4 > > I need the table in rows 37 to 43 in a matrix, for example: > 0201 ? ? NA ? ?3.5 ? ?3.4 ? ?3.5 > 0103 ? ? 3.5 ? ?3.0 ? ?2.5 ? ? 2.1 > 0104 ? ? 1.8 ? ?1.5 ? ?1.3 ? ?1.2 > 0105 ? ?1.2 ? ? 1.8 ? ?2.3 ? ?2.7 > 0106 ? ? 3.0 ? ?3.0 ? ?3.1 ? ?3.3 > 0107 ? ? 3.4 ? ?NA ? ?NA ? NA > > ?Unfortunately the row numbers vary per file. ?I can call up each line with > file[40,1] for line 40 for example. It returns: > [1] .E3 :0104: ? / ? ? ? 1.8/ ? ? ? 1.5/ ? ? ? 1.3/ ? ? ? 1.2 > 38 Levels: .E1 :0102: ? ? ? ? ? ? ?/ ? ? ? 3.5/ ? ? ? 3.4/ ? ? ? 3.5 ... > > ?So I have two problems really: > 1. How do I detect the table in the file (resp. the line where the table > starts)? > 2. How do I break up each line to write the values into a matrix? > > Feel free to suggest an entirely different approach if you think that is > helpful. > > Thanks a lot! Frauke > > > > -- > View this message in context: http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4464423.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
Hi Frauke, Try unix commands with R's system() function. Example: Let's say you have a matrix like this in the file (note: the first element is missing) called hello.txt 10 100 2 20 200 3 30 300 4 40 400 5 50 500 You can try something like: hello = system("cut -f1 hello.txt", intern=T) VP. On 11 March 2012 19:07, frauke <fhoss@andrew.cmu.edu> wrote:> Dear R community, > > I have the following problem I hoped you could help me with. > > My data is save in thousand of files with a weird extension containing for > numbers and a z. For example *.1405z. With list.files I managed to load > this > data into R. It looks like this (the row numbers are not in the original > file): > > 35 :LATEST STAGE 3.60 FT AT 730 AM CST ON > 0102 > 36 .ER ARCT2 0102 C > DC200001020813/DH12/HGIFF/DIH6 > 37 :QPF FORECAST 6AM NOON 6PM > MDNT > 38 .E1 :0102: / 3.5/ 3.4/ > 3.5 > 39 .E2 :0103: / 3.5/ 3.0/ 2.5/ > 2.1 > 40 .E3 :0104: / 1.8/ 1.5/ 1.3/ > 1.2 > 41 .E4 :0105: / 1.2/ 1.8/ 2.3/ > 2.7 > 42 .E5 :0106: / 3.0/ 3.0/ 3.1/ > 3.3 > 43 .E6 :0107: / > 3.4 > > I need the table in rows 37 to 43 in a matrix, for example: > 0201 NA 3.5 3.4 3.5 > 0103 3.5 3.0 2.5 2.1 > 0104 1.8 1.5 1.3 1.2 > 0105 1.2 1.8 2.3 2.7 > 0106 3.0 3.0 3.1 3.3 > 0107 3.4 NA NA NA > > Unfortunately the row numbers vary per file. I can call up each line with > file[40,1] for line 40 for example. It returns: > [1] .E3 :0104: / 1.8/ 1.5/ 1.3/ 1.2 > 38 Levels: .E1 :0102: / 3.5/ 3.4/ 3.5 ... > > So I have two problems really: > 1. How do I detect the table in the file (resp. the line where the table > starts)? > 2. How do I break up each line to write the values into a matrix? > > Feel free to suggest an entirely different approach if you think that is > helpful. > > Thanks a lot! Frauke > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4464423.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Thank you for the quick reply! I have attached two files. http://r.789695.n4.nabble.com/file/n4464511/sample1.1339z sample1.1339z http://r.789695.n4.nabble.com/file/n4464511/sample2.1949z sample2.1949z -- View this message in context: http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4464511.html Sent from the R help mailing list archive at Nabble.com.