thr3ads.net - R devel - [Rd] slightly speeding up readChar() [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Michael Lachmann

2011-Aug-04 21:26 UTC

[Rd] slightly speeding up readChar()

Hi,

I was trying to have R read files faster with readChar(). That was before I
noticed that readChar() is not that bad! In any case, below I suggest a few
simple changes that will make readChar slightly faster.

I followed readChar(useBytes=T), and tried to identify all O(N) operations,
where N is the size of the file. The assumption is that for LARGE files we want
to avoid any O(N) operations, and any O(N) memory allocations.

Here they are:

1. In readFixedString in envir.c, an N sized vector is 
allocated, and memset to 0. O(N)

2. The file is read into the buffer with con->read O(N) (but this probably
can't be dropped)

3. mkChar is called, which calls mkCharLenCE(name, strlen(name), CE_NATIVE);
   strlen is O(N)

4. In mkCharLenCE, a loop along the string looks for 0s to tell if the string
includes NULs (notice that because strlen was called before, that can't
really happen) O(N)

5. A hashcode is computed for the string to see if it is already in memory. That
is an O(N) operation.

6. A Charsxp of size N is allocated

7. The data is copied to the Charsxp - O(N).

So, as far as I could tell, in addition to the reading operation, 5 O(N)
operations are done, and double the memory of what is needed is allocated.

A couple of these operations are easy to drop:

1. One could only zero the memory beyond what was read, in case not N chars were
read.

2. We know the length of the string, so we can call mkCharLenCE directly from
readFixedString with the right length.

Others could maybe be dropped.

3. Does one really need to look for 0s?
In readFixedString there is a comment:
    /* String may contain nuls which we now (R >= 2.8.0) assume to be
       padding and ignore silently */
4. If a file was just read, is it likely that it is in the hash? Is it worth
paying the time for those people who read in the same file twice?

Finally about the allocation.

Could the Charsxp be allocated to begin with, and the data read straight into
it?
Then we'd save one extra allocation, and a memcpy. For that one would need
something like mkEmptyCharLen.

One could also allocate a slighly bigger memory region, and then pass that so
that instead of allocating it a new the old pointer is used (?).

In any case, here is an updated readFixedString(), which would drop 2 O(N)
operations.

---
static SEXP
	readFixedString(Rconnection con, int len, int useBytes)
{
	SEXP ans;
	char *buf;
	int  m;
	const void *vmax = vmaxget();

	if(utf8locale && !useBytes) {
		int i, clen;
		char *p, *q;

		p = buf = (char *) R_alloc(MB_CUR_MAX*len+1, sizeof(char));
		memset(buf, 0, MB_CUR_MAX*len+1);
		for(i = 0; i < len; i++) {
			q = p;
			m = con->read(p, sizeof(char), 1, con);
			if(!m) { if(i == 0) return R_NilValue; else break;}
			clen = utf8clen(*p++);
			if(clen > 1) {
				m = con->read(p, sizeof(char), clen - 1, con);
				if(m < clen - 1) error(_("invalid UTF-8 input in readChar()"));
				p += clen - 1;
		/* NB: this only checks validity of multi-byte characters */
				if((int)mbrtowc(NULL, q, clen, NULL) < 0)
					error(_("invalid UTF-8 input in readChar()"));
			}
		}
	} else {
		buf = (char *) R_alloc(len+1, sizeof(char));
   //memset() was here
		m = con->read(buf, sizeof(char), len, con);
		if(m < len )
			memset(buf, m+1, len+1); // changed
		if(len && !m) return R_NilValue;
	}
	/* String may contain nuls which we now (R >= 2.8.0) assume to be
	padding and ignore silently */
	ans = mkCharLenCE(buf, len, CE_NATIVE); // changed (one could also use no. read
bytes as size)
	vmaxset(vmax);
	return ans;
}
--

The other changes are also not that hard - I'd do them if people think such
changes should be included....


Thanks for listening,

Michael Lachmann

Simon Urbanek

2011-Aug-04 21:50 UTC

head link

[Rd] slightly speeding up readChar()

On Aug 4, 2011, at 5:26 PM, Michael Lachmann wrote:
> Hi,
> 
> I was trying to have R read files faster with readChar(). That was before I
noticed that readChar() is not that bad! In any case, below I suggest a few
simple changes that will make readChar slightly faster.
> 
> I followed readChar(useBytes=T), and tried to identify all O(N) operations,
where N is the size of the file. The assumption is that for LARGE files we want
to avoid any O(N) operations, and any O(N) memory allocations.
> 
I'm not sure it's really worth bothering with such optimizations, on my
machine I get
> system.time(readChar("large.file",1e8,T))   user  system elapsed 
  0.295   0.048   0.343 

so that is a fraction of a second for 100MB of data. Besides, why in the world
would you want to use character vectors to store bytes? It's much more
efficient to work with raw vectors instead...

AFAICS the only lesson from the list below is that we could add in an internal
function for safe_mkCharLenCE that doesn't check for NULs. But that seems a
little contrived for a very special case. Everything else is either increasing
complexity at the cost of safety or could lead to internal inconsistencies
(hashing is mandatory, otherwise you can't compare CHARSXPs).

BTW: your code doesn't do what you think it does - you'll force a pretty
ugly buffer overflow - perfect illustration why optimization should be done only
if really needed, otherwise you are just likely to introduce bugs...

Cheers,
Simon


> Here they are:
> 
> 1. In readFixedString in envir.c, an N sized vector is 
> allocated, and memset to 0. O(N)
> 
> 2. The file is read into the buffer with con->read O(N) (but this
probably can't be dropped)
> 
> 3. mkChar is called, which calls mkCharLenCE(name, strlen(name),
CE_NATIVE);
>   strlen is O(N)
> 
> 4. In mkCharLenCE, a loop along the string looks for 0s to tell if the
string includes NULs (notice that because strlen was called before, that
can't really happen) O(N)
> 
> 5. A hashcode is computed for the string to see if it is already in memory.
That is an O(N) operation.
> 
> 6. A Charsxp of size N is allocated
> 
> 7. The data is copied to the Charsxp - O(N).
> 
> So, as far as I could tell, in addition to the reading operation, 5 O(N)
operations are done, and double the memory of what is needed is allocated.
> 
> A couple of these operations are easy to drop:
> 
> 1. One could only zero the memory beyond what was read, in case not N chars
were read.
> 
> 2. We know the length of the string, so we can call mkCharLenCE directly
from readFixedString with the right length.
> 
> Others could maybe be dropped.
> 
> 3. Does one really need to look for 0s?
> In readFixedString there is a comment:
>    /* String may contain nuls which we now (R >= 2.8.0) assume to be
>       padding and ignore silently */
> 4. If a file was just read, is it likely that it is in the hash? Is it
worth paying the time for those people who read in the same file twice?
> 
> Finally about the allocation.
> 
> Could the Charsxp be allocated to begin with, and the data read straight
into it?
> Then we'd save one extra allocation, and a memcpy. For that one would
need something like mkEmptyCharLen.
> 
> One could also allocate a slighly bigger memory region, and then pass that
so that instead of allocating it a new the old pointer is used (?).
> 
> In any case, here is an updated readFixedString(), which would drop 2 O(N)
operations.
> 
> ---
> static SEXP
> 	readFixedString(Rconnection con, int len, int useBytes)
> {
> 	SEXP ans;
> 	char *buf;
> 	int  m;
> 	const void *vmax = vmaxget();
> 
> 	if(utf8locale && !useBytes) {
> 		int i, clen;
> 		char *p, *q;
> 
> 		p = buf = (char *) R_alloc(MB_CUR_MAX*len+1, sizeof(char));
> 		memset(buf, 0, MB_CUR_MAX*len+1);
> 		for(i = 0; i < len; i++) {
> 			q = p;
> 			m = con->read(p, sizeof(char), 1, con);
> 			if(!m) { if(i == 0) return R_NilValue; else break;}
> 			clen = utf8clen(*p++);
> 			if(clen > 1) {
> 				m = con->read(p, sizeof(char), clen - 1, con);
> 				if(m < clen - 1) error(_("invalid UTF-8 input in
readChar()"));
> 				p += clen - 1;
> 		/* NB: this only checks validity of multi-byte characters */
> 				if((int)mbrtowc(NULL, q, clen, NULL) < 0)
> 					error(_("invalid UTF-8 input in readChar()"));
> 			}
> 		}
> 	} else {
> 		buf = (char *) R_alloc(len+1, sizeof(char));
>   //memset() was here
> 		m = con->read(buf, sizeof(char), len, con);
> 		if(m < len )
> 			memset(buf, m+1, len+1); // changed
> 		if(len && !m) return R_NilValue;
> 	}
> 	/* String may contain nuls which we now (R >= 2.8.0) assume to be
> 	padding and ignore silently */
> 	ans = mkCharLenCE(buf, len, CE_NATIVE); // changed (one could also use no.
read bytes as size)
> 	vmaxset(vmax);
> 	return ans;
> }
> --
> 
> The other changes are also not that hard - I'd do them if people think
such changes should be included....
> 
> 
> Thanks for listening,
> 
> Michael Lachmann
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>

Reasonably Related Threads

Search for more maybe matching threads

R devel - Aug 2011 - slightly speeding up readChar()

[Rd] slightly speeding up readChar()

[Rd] slightly speeding up readChar()

Reasonably Related Threads