If I have a web log file?as follows: #Software: Microsoft Internet Information Services 5.0 #Version: 1.0 #Date: 2007-12-03 13:50:17 #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Cookie) cs(Referer) "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET /localidades/img/nada.gif - 200 328 447 0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA http://www.teatro.com/localidades/localidades.asp" "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET /localidades/img/cargando.gif - 200 1150 451 0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA http://www.teatro.com/localidades/localidades.asp" "2007-12-03 13:50:18 200.40.203.197 - 200.40.51.20 80 GET /localidades/img/cerrar.png - 200 450 449 0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) how can I turn it into a dataframe with?3 rows, and 16 columns named date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Cookie) cs(Referer)?skiping lines begining with #? Thanks, Sebasti?n.
Here is a way to do it. I assume that you data has each record on a line; it came through the email as multiple lines.> x <- readLines("/tempxx.txt") > # remove '#Fields:" so it can be used as a header > x <- sub("^#Fields: ", "", x) > # remove comment lines > x <- x[-grep("^#", x)] > # remove quotes > x <- gsub('"', '', x) > # now read in the data > input <- read.table(textConnection(x), header=TRUE) > > str(input)'data.frame': 2 obs. of 16 variables: $ date : Factor w/ 1 level "2007-12-03": 1 1 $ time : Factor w/ 1 level "13:50:17": 1 1 $ c.ip : Factor w/ 1 level "200.40.203.197": 1 1 $ cs.username : Factor w/ 1 level "-": 1 1 $ s.ip : Factor w/ 1 level "200.40.51.20": 1 1 $ s.port : int 80 80 $ cs.method : Factor w/ 1 level "GET": 1 1 $ cs.uri.stem : Factor w/ 2 levels "/localidades/img/cargando.gif",..: 2 1 $ cs.uri.query : Factor w/ 1 level "-": 1 1 $ sc.status : int 200 200 $ sc.bytes : int 328 1150 $ cs.bytes : int 447 451 $ time.taken : int 0 0 $ cs.User.Agent.: Factor w/ 1 level "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)": 1 1 $ cs.Cookie. : Factor w/ 1 level "ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA": 1 1 $ cs.Referer. : Factor w/ 1 level "http://www.teatro.com/localidades/localidades.asp": 1 1>On Tue, Sep 22, 2009 at 9:51 PM, Sebastian Kruk <residuo.solow at gmail.com> wrote:> If I have a web log file?as follows: > > #Software: Microsoft Internet Information Services 5.0 > #Version: 1.0 > #Date: 2007-12-03 13:50:17 > #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem > cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) > cs(Cookie) cs(Referer) > "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET > /localidades/img/nada.gif - 200 328 447 0 > Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) > ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA > http://www.teatro.com/localidades/localidades.asp" > "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET > /localidades/img/cargando.gif - 200 1150 451 0 > Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) > ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA > http://www.teatro.com/localidades/localidades.asp" > "2007-12-03 13:50:18 200.40.203.197 - 200.40.51.20 80 GET > /localidades/img/cerrar.png - 200 450 449 0 > Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322) > > how can I turn it into a dataframe with?3 rows, and 16 columns named > date time c-ip cs-username s-ip s-port cs-method cs-uri-stem > cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) > cs(Cookie) cs(Referer)?skiping lines begining with #? > > Thanks, > > Sebasti?n. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Sebastian, There is rarely a completely free lunch, but fortunately for us R has some wonderful tools to make this possible. R supports regular expressions with commands like grep(), gsub(), strsplit(), and others documented on the help pages. It's just a matter of constructing and algorithm that does the job. In your case, for example (though please note there are probably many different, completely reasonable approaches in R): x <- scan("logfilename", what="", sep="\n") should give you a vector of character strings, one line per element. Now, lines containing "GET" seem to identify interesting lines, so x <- x[grep("GET", x)] should trim it to only the interesting lines. If you want information from other lines, you'll have to treat them separately. Next, you might try y <- strsplit(x) which by default splits on whitespace, returning a list (one component per line) of vectors based on the split. Try it. It it looks good, you might check lapply(y, length) to see if all lines contain the same number of records. If so, you can then get quickly into a matrix, z <- matrix(unlist(strsplit(x)), ncol=K, byrow=TRUE) where K is the common length you just observed. If you think this is cool, great! If not, well... hire a programmer, or if you're lucky Microsoft or Apache have tools to help you with this. There might be something in the Perl/Python world. Or maybe there's a package in R designed just for this, but I encourage students to develop the raw skills... Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay