thr3ads.net - R help - [R] reading web log file into R [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Sebastian Kruk

2009-Sep-23 01:51 UTC

[R] reading web log file into R

If I have a web log file?as follows:

#Software: Microsoft Internet Information Services 5.0
#Version: 1.0
#Date: 2007-12-03 13:50:17
#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem
cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent)
cs(Cookie) cs(Referer)
"2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET
/localidades/img/nada.gif - 200 328 447 0
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA
http://www.teatro.com/localidades/localidades.asp"
"2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET
/localidades/img/cargando.gif - 200 1150 451 0
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA
http://www.teatro.com/localidades/localidades.asp"
"2007-12-03 13:50:18 200.40.203.197 - 200.40.51.20 80 GET
/localidades/img/cerrar.png - 200 450 449 0
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)

how can I turn it into a dataframe with?3 rows, and 16 columns named
date time c-ip cs-username s-ip s-port cs-method cs-uri-stem
cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent)
cs(Cookie) cs(Referer)?skiping lines begining with #?

Thanks,

Sebasti?n.

jim holtman

2009-Sep-23 12:22 UTC

head link

[R] reading web log file into R

Here is a way to do it.  I assume that you data has each record on a
line; it came through the email as multiple lines.

> x <- readLines("/tempxx.txt")
> # remove '#Fields:" so it can be used as a header
> x <- sub("^#Fields: ", "", x)
> # remove comment lines
> x <- x[-grep("^#", x)]
> # remove quotes
> x <- gsub('"', '', x)
> # now read in the data
> input <- read.table(textConnection(x), header=TRUE)
>
> str(input)'data.frame':   2 obs. of  16 variables:
 $ date          : Factor w/ 1 level "2007-12-03": 1 1
 $ time          : Factor w/ 1 level "13:50:17": 1 1
 $ c.ip          : Factor w/ 1 level "200.40.203.197": 1 1
 $ cs.username   : Factor w/ 1 level "-": 1 1
 $ s.ip          : Factor w/ 1 level "200.40.51.20": 1 1
 $ s.port        : int  80 80
 $ cs.method     : Factor w/ 1 level "GET": 1 1
 $ cs.uri.stem   : Factor w/ 2 levels
"/localidades/img/cargando.gif",..: 2 1
 $ cs.uri.query  : Factor w/ 1 level "-": 1 1
 $ sc.status     : int  200 200
 $ sc.bytes      : int  328 1150
 $ cs.bytes      : int  447 451
 $ time.taken    : int  0 0
 $ cs.User.Agent.: Factor w/ 1 level
"Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)":
1 1
 $ cs.Cookie.    : Factor w/ 1 level
"ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA": 1 1
 $ cs.Referer.   : Factor w/ 1 level
"http://www.teatro.com/localidades/localidades.asp": 1
1>

On Tue, Sep 22, 2009 at 9:51 PM, Sebastian Kruk <residuo.solow at
gmail.com> wrote:> If I have a web log file?as follows:
>
> #Software: Microsoft Internet Information Services 5.0
> #Version: 1.0
> #Date: 2007-12-03 13:50:17
> #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem
> cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent)
> cs(Cookie) cs(Referer)
> "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET
> /localidades/img/nada.gif - 200 328 447 0
> Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
> ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA
> http://www.teatro.com/localidades/localidades.asp"
> "2007-12-03 13:50:17 200.40.203.197 - 200.40.51.20 80 GET
> /localidades/img/cargando.gif - 200 1150 451 0
> Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
> ASPSESSIONIDSQCBSQAB=JOLECDCCBFCKPOFLGDLHMENA
> http://www.teatro.com/localidades/localidades.asp"
> "2007-12-03 13:50:18 200.40.203.197 - 200.40.51.20 80 GET
> /localidades/img/cerrar.png - 200 450 449 0
> Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
>
> how can I turn it into a dataframe with?3 rows, and 16 columns named
> date time c-ip cs-username s-ip s-port cs-method cs-uri-stem
> cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent)
> cs(Cookie) cs(Referer)?skiping lines begining with #?
>
> Thanks,
>
> Sebasti?n.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Jay Emerson

2009-Sep-23 12:32 UTC

head link

[R] reading web log file into R

Sebastian,

There is rarely a completely free lunch, but fortunately for us R has
some wonderful tools
to make this possible.  R supports regular expressions with commands
like grep(),
gsub(), strsplit(), and others documented on the help pages.  It's
just a matter of
constructing and algorithm that does the job.  In your case, for
example (though please
note there are probably many different, completely reasonable approaches in R):

x <- scan("logfilename", what="", sep="\n")

should give you a vector of character strings, one line per element.  Now, lines
containing "GET" seem to identify interesting lines, so

x <- x[grep("GET", x)]

should trim it to only the interesting lines.  If you want information
from other lines, you'll
have to treat them separately.  Next, you might try

y <- strsplit(x)

which by default splits on whitespace, returning a list (one component
per line) of vectors
based on the split.  Try it.  It it looks good, you might check

lapply(y, length)

to see if all lines contain the same number of records.  If so, you
can then get quickly into
a matrix,

z <- matrix(unlist(strsplit(x)), ncol=K, byrow=TRUE)

where K is the common length you just observed.  If you think this is
cool, great!  If not, well...
hire a programmer, or if you're lucky Microsoft or Apache have tools
to help you with this.
There might be something in the Perl/Python world.  Or maybe there's a
package in R designed
just for this, but I encourage students to develop the raw skills...

Jay



-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Sep 2009 - reading web log file into R

[R] reading web log file into R

[R] reading web log file into R

[R] reading web log file into R

Apparently Analagous Threads