thr3ads.net - R help - [R] splitting very long character string [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Arne.Muller at sanofi-aventis.com

2006-Nov-01 15:47 UTC

[R] splitting very long character string

Hello,

I've a very long character array (>500k characters) that need to split by
'\n' resulting in an array of about 60k numbers. The help on strsplit
says to use perl=TRUE to get better formance, but still it takes several minutes
to split this string.

The massive string is the return value of a call to xmlElementsByTagName from
the XML library and looks like this:

...
12345
564376
5674
6356656
5666
...

I've to read about a hundred of these files and was wondering whether
there's a more efficient way to turn this string into an array of numerics.
Any ideas?

	thanks a lot for your help
	and kind regards,

	Arne




	[[alternative HTML version deleted]]

john seers (IFR)

2006-Nov-01 16:00 UTC

head link

[R] splitting very long character string

Hi Arne

If you are reading in from files and they are just one number per line
it would be more efficient to use scan directly.  ?scan

For example:
> filen<-"C:/temp/tt.txt"
> i<-scan(filen)
Read 5 items> i
[1]   12345  564376    5674 6356656    5666> 

 


-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
Arne.Muller at sanofi-aventis.com
Sent: 01 November 2006 15:47
To: r-help at stat.math.ethz.ch
Subject: [R] splitting very long character string


Hello,

I've a very long character array (>500k characters) that need to split
by '\n' resulting in an array of about 60k numbers. The help on strsplit
says to use perl=TRUE to get better formance, but still it takes several
minutes to split this string.

The massive string is the return value of a call to xmlElementsByTagName
from the XML library and looks like this:

...
12345
564376
5674
6356656
5666
...

I've to read about a hundred of these files and was wondering whether
there's a more efficient way to turn this string into an array of
numerics. Any ideas?

	thanks a lot for your help
	and kind regards,

	Arne




	[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Marc Schwartz

2006-Nov-01 16:05 UTC

head link

[R] splitting very long character string

On Wed, 2006-11-01 at 16:47 +0100, Arne.Muller at sanofi-aventis.com
wrote:> Hello,
> 
> I've a very long character array (>500k characters) that need to
split
> by '\n' resulting in an array of about 60k numbers. The help on
> strsplit says to use perl=TRUE to get better formance, but still it
> takes several minutes to split this string.
> 
> The massive string is the return value of a call to
> xmlElementsByTagName from the XML library and looks like this:
> 
> ...
> 12345
> 564376
> 5674
> 6356656
> 5666
> ...
> 
> I've to read about a hundred of these files and was wondering whether
> there's a more efficient way to turn this string into an array of
> numerics. Any ideas?
> 
> 	thanks a lot for your help
> 	and kind regards,
> 
> 	Arne
> 
Vec <- sample(c(0:9, "\n"), 500000, replace = TRUE)
> str(Vec) chr [1:500000] "7" "0" "9" "6"
"5" "3" "1" "9" ...
> table(Vec)Vec
   \n     0     1     2     3     4     5     6     7     8     9
45432 45723 45641 45526 45460 45284 45378 45392 45374 45314 45476

> sink("Vec.txt")
> cat(Vec)
> sink()
First 10 lines of Vec.txt:

7 0 9 6 5 3 1 9 8 1 8 3 4 2 
 1 2 2 
 3 7 7 6 8 3 4 7 4 
 9 2 1 9 8 7 2 0 9 4 3 
 9 3 5 2 2 5 8 0 5 4 5 6 1 5 8 7 4 1 2 8 3 2 6 4 9 4 1 6 8 5 0 8 8 8 5 3 0 5 3 5
4 8 5 4 3
 9 
 5 3 6 5 8 9 7 6 9 
 5 8 
 2 4 6 
 5 
> system.time(Vec.Split <- scan("Vec.txt", sep =
"\n"))Read 41276 items
[1] 0.180 0.004 0.186 0.000 0.000
> str(Vec.Split) num [1:41276] 7.10e+13 1.22e+02 3.78e+08 9.22e+10 9.35e+44 ...
> sprintf("%.0f", Vec.Split[1:10]) [1] "70965319818342"
 [2] "122"
 [3] "377683474"
 [4] "92198720943"
 [5] "935225805456158720742405574866620654670577664"
 [6] "9"
 [7] "536589769"
 [8] "58"
 [9] "246"
[10] "5"


Does that help?

Marc Schwartz

Prof Brian Ripley

2006-Nov-01 16:14 UTC

head link

[R] splitting very long character string

On Wed, 1 Nov 2006, Arne.Muller at sanofi-aventis.com wrote:
> Hello,
>
> I've a very long character array (>500k characters) that need to
split
> by '\n' resulting in an array of about 60k numbers. The help on
strsplit
> says to use perl=TRUE to get better formance, but still it takes several 
> minutes to split this string.
Can't you use fixed=TRUE since you do not have a regular expression?
Nevertheless, if you are going to be creating about 60k character strings, 
the overhead in creating the strings will be very considerable.

If you just want the numbers, using an anonymous file() connection to 
write out the string and then using scan() might well be a lot more 
efficient.
> The massive string is the return value of a call to xmlElementsByTagName 
> from the XML library and looks like this:                ^^^^^^^
'package' or your own C code accessing libxml?
> ...
> 12345
> 564376
> 5674
> 6356656
> 5666
> ...
>
> I've to read about a hundred of these files and was wondering whether
there's a more efficient way to turn this string into an array of numerics.
Any ideas?
>
> 	thanks a lot for your help
> 	and kind regards,
>
> 	Arne
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Arne.Muller at sanofi-aventis.com

2006-Nov-02 10:24 UTC

head link

[R] splitting very long character string

Hello,

thanks a lot for your help on splitting the string to get a numeric vector.
I'm now writign the string to a tempfile and read it in via scan - this is
fa&st enough for me:

library(XML);

...
tmp = xmlElementsByTagName(root, 'tofDataSample', recursive=T);
tmp = xmlValue(tmp[[1]]);
cat(paste('splitting', nchar(tmp), 'string ...\n'));
tmp.file = tempfile();
sink(tmp.file);
cat(tmp);
sink();
tmp = scan(tmp.file);
unlink(tmp.file);
cat(paste('splitting done,', length(tmp), 'elements\n'));

	thanks again
	and kind regards,

	Arne
> -----Original Message-----
> From: john seers (IFR) [mailto:john.seers at bbsrc.ac.uk]
> Sent: Wednesday, November 01, 2006 17:01
> To: Muller, Arne PH/FR; r-help at stat.math.ethz.ch
> Subject: RE: [R] splitting very long character string
> 
> 
> 
> Hi Arne
> 
> If you are reading in from files and they are just one number per line
> it would be more efficient to use scan directly.  ?scan
> 
> For example:
> 
> > filen<-"C:/temp/tt.txt"
> > i<-scan(filen)
> Read 5 items
> > i
> [1]   12345  564376    5674 6356656    5666
> > 
> 
> 
>  
> 
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
> Arne.Muller at sanofi-aventis.com
> Sent: 01 November 2006 15:47
> To: r-help at stat.math.ethz.ch
> Subject: [R] splitting very long character string
> 
> 
> Hello,
> 
> I've a very long character array (>500k characters) that need to
split
> by '\n' resulting in an array of about 60k numbers. The help 
> on strsplit
> says to use perl=TRUE to get better formance, but still it 
> takes several
> minutes to split this string.
> 
> The massive string is the return value of a call to 
> xmlElementsByTagName
> from the XML library and looks like this:
> 
> ....
> 12345
> 564376
> 5674
> 6356656
> 5666
> ....
> 
> I've to read about a hundred of these files and was wondering whether
> there's a more efficient way to turn this string into an array of
> numerics. Any ideas?
> 
> 	thanks a lot for your help
> 	and kind regards,
> 
> 	Arne
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Reasonably Related Threads

Search for more maybe matching threads

R help - Nov 2006 - splitting very long character string

[R] splitting very long character string

[R] splitting very long character string

[R] splitting very long character string

[R] splitting very long character string

[R] splitting very long character string

Reasonably Related Threads