If I have a web log file?as follows: #Software: Microsoft Internet Information Services 5.0 #Version: 1.0 #Date: 2007-12-03 13:50:17 #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Cookie) cs(Referer) "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET /localidades/img/nada.gif - 200 328 447 0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA http://www.teatro.com/localidades/localidades.asp" "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET /localidades/img/cargando.gif - 200 1150 451 0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA http://www.teatro.com/localidades/localidades.asp" "2007-12-03 13:50:18 200.40.203.197 - 200.40.51.20 80 GET /localidades/img/cerrar.png - 200 450 449 0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) how can I turn it into a dataframe with?3 rows, and 16 columns named date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Cookie) cs(Referer)?skiping lines begining with #? Thanks, Sebasti?n.
Here is a way to do it. I assume that you data has each record on a line; it came through the email as multiple lines.> x <- readLines("/tempxx.txt") > # remove '#Fields:" so it can be used as a header > x <- sub("^#Fields: ", "", x) > # remove comment lines > x <- x[-grep("^#", x)] > # remove quotes > x <- gsub('"', '', x) > # now read in the data > input <- read.table(textConnection(x), header=TRUE) > > str(input)'data.frame': 2 obs. of 16 variables: $ date : Factor w/ 1 level "2007-12-03": 1 1 $ time : Factor w/ 1 level "13:50:17": 1 1 $ c.ip : Factor w/ 1 level "200.40.203.197": 1 1 $ cs.username : Factor w/ 1 level "-": 1 1 $ s.ip : Factor w/ 1 level "200.40.51.20": 1 1 $ s.port : int 80 80 $ cs.method : Factor w/ 1 level "GET": 1 1 $ cs.uri.stem : Factor w/ 2 levels "/localidades/img/cargando.gif",..: 2 1 $ cs.uri.query : Factor w/ 1 level "-": 1 1 $ sc.status : int 200 200 $ sc.bytes : int 328 1150 $ cs.bytes : int 447 451 $ time.taken : int 0 0 $ cs.User.Agent.: Factor w/ 1 level "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)": 1 1 $ cs.Cookie. : Factor w/ 1 level "ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA": 1 1 $ cs.Referer. : Factor w/ 1 level "http://www.teatro.com/localidades/localidades.asp": 1 1>On Tue, Sep 22, 2009 at 9:51 PM, Sebastian Kruk <residuo.solow at gmail.com> wrote:> If I have a web log file?as follows: > > #Software: Microsoft Internet Information Services 5.0 > #Version: 1.0 > #Date: 2007-12-03 13:50:17 > #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem > cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) > cs(Cookie) cs(Referer) > "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET > /localidades/img/nada.gif - 200 328 447 0 > Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) > ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA > http://www.teatro.com/localidades/localidades.asp" > "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET > /localidades/img/cargando.gif - 200 1150 451 0 > Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) > ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA > http://www.teatro.com/localidades/localidades.asp" > "2007-12-03 13:50:18 200.40.203.197 - 200.40.51.20 80 GET > /localidades/img/cerrar.png - 200 450 449 0 > Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) > > how can I turn it into a dataframe with?3 rows, and 16 columns named > date time c-ip cs-username s-ip s-port cs-method cs-uri-stem > cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) > cs(Cookie) cs(Referer)?skiping lines begining with #? > > Thanks, > > Sebasti?n. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Sebastian,
There is rarely a completely free lunch, but fortunately for us R has
some wonderful tools
to make this possible.  R supports regular expressions with commands
like grep(),
gsub(), strsplit(), and others documented on the help pages.  It's
just a matter of
constructing and algorithm that does the job.  In your case, for
example (though please
note there are probably many different, completely reasonable approaches in R):
x <- scan("logfilename", what="", sep="\n")
should give you a vector of character strings, one line per element.  Now, lines
containing "GET" seem to identify interesting lines, so
x <- x[grep("GET", x)]
should trim it to only the interesting lines.  If you want information
from other lines, you'll
have to treat them separately.  Next, you might try
y <- strsplit(x)
which by default splits on whitespace, returning a list (one component
per line) of vectors
based on the split.  Try it.  It it looks good, you might check
lapply(y, length)
to see if all lines contain the same number of records.  If so, you
can then get quickly into
a matrix,
z <- matrix(unlist(strsplit(x)), ncol=K, byrow=TRUE)
where K is the common length you just observed.  If you think this is
cool, great!  If not, well...
hire a programmer, or if you're lucky Microsoft or Apache have tools
to help you with this.
There might be something in the Perl/Python world.  Or maybe there's a
package in R designed
just for this, but I encourage students to develop the raw skills...
Jay
-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay