tfpt review /shelveset:MutableString5;REDMOND\tomat A new implementation for Ruby MutableString and Ruby regular expression wrappers. This is just the first pass, w/o optimizations and w/o encodings (Default system encoding is used for all strings). Many improvements and adjustments will come in future, some hacks will be removed. Basic architecture: MutableString holds on Content and Encoding. Content is an abstract class that has three subclasses: 1) StringContent - Holds on an instance of System.String - an immutable .NET string. This is the default representation for strings coming from CLR methods and for Ruby string literals. - A textual write operation on the mutable string that has this content representation will cause implicit conversion of the representation to StringBuilderContent. - A binary read/write operation triggers a transition to BinaryContent using the Encoding stored on the owning MutableString. 2) StringBuilderContent - Holds on an instance of System.Text.StringBuilder - a mutable Unicode string. - A binary read/write operation transforms the content to BinaryContent representation. - StringBuilder is not optimal for some operations (requires unnecessary copying), we may consider to replace it with resizable char[]. 3) BinaryContent - A textual read/write operation transforms the content to StringBuilderContent representation. - List<byte> is currently used, but it doesn''t fit many operations very well. We should replace it by resizable byte[]. The content representation is changed based upon operations that are performed on the mutable string. There is currently no limit on number of content type switches, so if one alternates binary and textual operations the conversion will take place for each one of them. Although this shouldn''t be a common case we may consider to add some counters and keep the representation binary/textual based upon their values. The design assumes that the nature of operations implemented by library methods is of two kinds: textual and binary. And that data that are once treated as text are not usually treated as raw binary data later. Any text in the IronRuby runtime is represented as a sequence of 16bit Unicode characters (standard .NET representation). Each binary data treated as text is converted to this representation, regardless of the encoding used for storage representation in the file. The encoding is remembered in the MutableString instance and the original representation could be always recreated. Not all Unicode characters fit into 16 bits, therefore some exotic ones are represented by multiple characters (surrogates). If there is such a character in the string, some operations (e.g. indexing) might not be precise anymore - the n-th item in the char[] isn''t the n-th Unicode character in the string (there might be escape characters). We believe this impreciseness is not a real world issue and is worth performance gain and implementation simplicity. Tomas -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20080509/ba9a7f1c/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: MutableString5.diff Type: application/octet-stream Size: 326215 bytes Desc: MutableString5.diff URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20080509/ba9a7f1c/attachment-0001.obj>
This is a big old diff to search through. I couldn''t work out a way of easily patching it onto my source at home due to the folder differences. I really like this hybrid idea and it looks like it will work well. I have one question with regards to encodings and KCODE. I appreciate that String is changing between Ruby 1.8 and 1.9. It appears that this MutableString implementation is leaning more toward the 1.9 implementation (i.e. holding on to an Encoding within the String itself). 1.8 does hold the encoding and as I understand it the implicit encoding of the bytes held in a String is driven off KCODE. Is that correct? If so you have a number of scenarios which I think could cause problems with MutableString holding on to its own Encoding, which stem from times when KCODE is changed at runtime. I''ll try to describe a concrete example and you can tell me where I am going wrong... Assume that KCODE is set to UTF8. If you create a String from an array of bytes in Ruby, the bytes are just stored as-is. You can do stuff which is encoding dependent and UTF8 is assumed. If you now change KCODE to say EUC, then the bytes in the String are unchanged but now encoding dependent operations will possibly produce different results on the same string since they interpret the bytes differently. The worry I have with MutableString, is that if you create a string from bytes but then do an operation that requires it to be converted to a CLR string internally. What happens when you change KCODE? You can''t simply change the Encoding value of the MutableString, since if you then access the bytes you will not get the same bytes back as were originally put in. I suppose, on changing KCODE, you could go through all the strings in memory, which have been converted from binary to CLR strings, and convert them (i.e. back to bytes via the old encoding and then to CLR strings via the new encoding). What would be the optimal solution in this case? Again, I am not talking from a position of deep knowledge here so I may be missing something really obvious. But I thought it was worth asking the question. Regards, Pete From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Tomas Matousek Sent: Friday,09 May 09, 2008 19:08 To: IronRuby External Code Reviewers Cc: ironruby-core at rubyforge.org Subject: [Ironruby-core] Code Review: MutableString5 tfpt review /shelveset:MutableString5;REDMOND\tomat A new implementation for Ruby MutableString and Ruby regular expression wrappers. This is just the first pass, w/o optimizations and w/o encodings (Default system encoding is used for all strings). Many improvements and adjustments will come in future, some hacks will be removed. Basic architecture: MutableString holds on Content and Encoding. Content is an abstract class that has three subclasses: 1) StringContent - Holds on an instance of System.String - an immutable .NET string. This is the default representation for strings coming from CLR methods and for Ruby string literals. - A textual write operation on the mutable string that has this content representation will cause implicit conversion of the representation to StringBuilderContent. - A binary read/write operation triggers a transition to BinaryContent using the Encoding stored on the owning MutableString. 2) StringBuilderContent - Holds on an instance of System.Text.StringBuilder - a mutable Unicode string. - A binary read/write operation transforms the content to BinaryContent representation. - StringBuilder is not optimal for some operations (requires unnecessary copying), we may consider to replace it with resizable char[]. 3) BinaryContent - A textual read/write operation transforms the content to StringBuilderContent representation. - List<byte> is currently used, but it doesn''t fit many operations very well. We should replace it by resizable byte[]. The content representation is changed based upon operations that are performed on the mutable string. There is currently no limit on number of content type switches, so if one alternates binary and textual operations the conversion will take place for each one of them. Although this shouldn''t be a common case we may consider to add some counters and keep the representation binary/textual based upon their values. The design assumes that the nature of operations implemented by library methods is of two kinds: textual and binary. And that data that are once treated as text are not usually treated as raw binary data later. Any text in the IronRuby runtime is represented as a sequence of 16bit Unicode characters (standard .NET representation). Each binary data treated as text is converted to this representation, regardless of the encoding used for storage representation in the file. The encoding is remembered in the MutableString instance and the original representation could be always recreated. Not all Unicode characters fit into 16 bits, therefore some exotic ones are represented by multiple characters (surrogates). If there is such a character in the string, some operations (e.g. indexing) might not be precise anymore - the n-th item in the char[] isn''t the n-th Unicode character in the string (there might be escape characters). We believe this impreciseness is not a real world issue and is worth performance gain and implementation simplicity. Tomas -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20080510/49c84300/attachment.html>
$KCODE is orthogonal to the encoding in MutableString. $KCODE seems to be just a value that is used by some library methods that perform binary operations on textual data. MutableString.Encoding is encoding of the representation. If a MutableString instance is created from .NET string an encoding that is associated with it is used whenever the string is consumed by a binary data operation. We could represent all strings as byte[], but then you''d need to convert .NET strings to byte[] at the construction time. MutableString allows you to be lazy and perhaps not perform the conversion at all if not needed. Could you give some code sample that you think could be broken? Tomas From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Peter Bacon Darwin Sent: Saturday, May 10, 2008 2:27 AM To: ironruby-core at rubyforge.org Subject: Re: [Ironruby-core] Code Review: MutableString5 This is a big old diff to search through. I couldn''t work out a way of easily patching it onto my source at home due to the folder differences. I really like this hybrid idea and it looks like it will work well. I have one question with regards to encodings and KCODE. I appreciate that String is changing between Ruby 1.8 and 1.9. It appears that this MutableString implementation is leaning more toward the 1.9 implementation (i.e. holding on to an Encoding within the String itself). 1.8 does hold the encoding and as I understand it the implicit encoding of the bytes held in a String is driven off KCODE. Is that correct? If so you have a number of scenarios which I think could cause problems with MutableString holding on to its own Encoding, which stem from times when KCODE is changed at runtime. I''ll try to describe a concrete example and you can tell me where I am going wrong... Assume that KCODE is set to UTF8. If you create a String from an array of bytes in Ruby, the bytes are just stored as-is. You can do stuff which is encoding dependent and UTF8 is assumed. If you now change KCODE to say EUC, then the bytes in the String are unchanged but now encoding dependent operations will possibly produce different results on the same string since they interpret the bytes differently. The worry I have with MutableString, is that if you create a string from bytes but then do an operation that requires it to be converted to a CLR string internally. What happens when you change KCODE? You can''t simply change the Encoding value of the MutableString, since if you then access the bytes you will not get the same bytes back as were originally put in. I suppose, on changing KCODE, you could go through all the strings in memory, which have been converted from binary to CLR strings, and convert them (i.e. back to bytes via the old encoding and then to CLR strings via the new encoding). What would be the optimal solution in this case? Again, I am not talking from a position of deep knowledge here so I may be missing something really obvious. But I thought it was worth asking the question. Regards, Pete From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Tomas Matousek Sent: Friday,09 May 09, 2008 19:08 To: IronRuby External Code Reviewers Cc: ironruby-core at rubyforge.org Subject: [Ironruby-core] Code Review: MutableString5 tfpt review /shelveset:MutableString5;REDMOND\tomat A new implementation for Ruby MutableString and Ruby regular expression wrappers. This is just the first pass, w/o optimizations and w/o encodings (Default system encoding is used for all strings). Many improvements and adjustments will come in future, some hacks will be removed. Basic architecture: MutableString holds on Content and Encoding. Content is an abstract class that has three subclasses: 1) StringContent - Holds on an instance of System.String - an immutable .NET string. This is the default representation for strings coming from CLR methods and for Ruby string literals. - A textual write operation on the mutable string that has this content representation will cause implicit conversion of the representation to StringBuilderContent. - A binary read/write operation triggers a transition to BinaryContent using the Encoding stored on the owning MutableString. 2) StringBuilderContent - Holds on an instance of System.Text.StringBuilder - a mutable Unicode string. - A binary read/write operation transforms the content to BinaryContent representation. - StringBuilder is not optimal for some operations (requires unnecessary copying), we may consider to replace it with resizable char[]. 3) BinaryContent - A textual read/write operation transforms the content to StringBuilderContent representation. - List<byte> is currently used, but it doesn''t fit many operations very well. We should replace it by resizable byte[]. The content representation is changed based upon operations that are performed on the mutable string. There is currently no limit on number of content type switches, so if one alternates binary and textual operations the conversion will take place for each one of them. Although this shouldn''t be a common case we may consider to add some counters and keep the representation binary/textual based upon their values. The design assumes that the nature of operations implemented by library methods is of two kinds: textual and binary. And that data that are once treated as text are not usually treated as raw binary data later. Any text in the IronRuby runtime is represented as a sequence of 16bit Unicode characters (standard .NET representation). Each binary data treated as text is converted to this representation, regardless of the encoding used for storage representation in the file. The encoding is remembered in the MutableString instance and the original representation could be always recreated. Not all Unicode characters fit into 16 bits, therefore some exotic ones are represented by multiple characters (surrogates). If there is such a character in the string, some operations (e.g. indexing) might not be precise anymore - the n-th item in the char[] isn''t the n-th Unicode character in the string (there might be escape characters). We believe this impreciseness is not a real world issue and is worth performance gain and implementation simplicity. Tomas -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20080510/2a6c7c3e/attachment-0001.html>
One thing that MutableString could do with is
public static MutableString/*!*/ CreateBinary(byte[]/*!*/ bytes, int
start, int length) {
At the moment you have to do something like:
MutableString str = MutableString.CreateBinary();
str.Append(buffer, 0, received);
Pete
From: ironruby-core-bounces at rubyforge.org
[mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Tomas Matousek
Sent: Saturday,10 May 10, 2008 22:42
To: ironruby-core at rubyforge.org
Subject: Re: [Ironruby-core] Code Review: MutableString5
$KCODE is orthogonal to the encoding in MutableString. $KCODE seems to be
just a value that is used by some library methods that perform binary
operations on textual data. MutableString.Encoding is encoding of the
representation. If a MutableString instance is created from .NET string an
encoding that is associated with it is used whenever the string is consumed
by a binary data operation. We could represent all strings as byte[], but
then you''d need to convert .NET strings to byte[] at the construction
time.
MutableString allows you to be lazy and perhaps not perform the conversion
at all if not needed.
Could you give some code sample that you think could be broken?
Tomas
From: ironruby-core-bounces at rubyforge.org
[mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Peter Bacon Darwin
Sent: Saturday, May 10, 2008 2:27 AM
To: ironruby-core at rubyforge.org
Subject: Re: [Ironruby-core] Code Review: MutableString5
This is a big old diff to search through. I couldn''t work out a way of
easily patching it onto my source at home due to the folder differences.
I really like this hybrid idea and it looks like it will work well. I have
one question with regards to encodings and KCODE.
I appreciate that String is changing between Ruby 1.8 and 1.9. It appears
that this MutableString implementation is leaning more toward the 1.9
implementation (i.e. holding on to an Encoding within the String itself).
1.8 does hold the encoding and as I understand it the implicit encoding of
the bytes held in a String is driven off KCODE. Is that correct? If so you
have a number of scenarios which I think could cause problems with
MutableString holding on to its own Encoding, which stem from times when
KCODE is changed at runtime. I''ll try to describe a concrete example
and
you can tell me where I am going wrong...
Assume that KCODE is set to UTF8. If you create a String from an array of
bytes in Ruby, the bytes are just stored as-is. You can do stuff which is
encoding dependent and UTF8 is assumed.
If you now change KCODE to say EUC, then the bytes in the String are
unchanged but now encoding dependent operations will possibly produce
different results on the same string since they interpret the bytes
differently.
The worry I have with MutableString, is that if you create a string from
bytes but then do an operation that requires it to be converted to a CLR
string internally. What happens when you change KCODE? You can''t
simply
change the Encoding value of the MutableString, since if you then access the
bytes you will not get the same bytes back as were originally put in. I
suppose, on changing KCODE, you could go through all the strings in memory,
which have been converted from binary to CLR strings, and convert them (i.e.
back to bytes via the old encoding and then to CLR strings via the new
encoding). What would be the optimal solution in this case?
Again, I am not talking from a position of deep knowledge here so I may be
missing something really obvious. But I thought it was worth asking the
question.
Regards,
Pete
From: ironruby-core-bounces at rubyforge.org
[mailto:ironruby-core-bounces at rubyforge.org] On Behalf Of Tomas Matousek
Sent: Friday,09 May 09, 2008 19:08
To: IronRuby External Code Reviewers
Cc: ironruby-core at rubyforge.org
Subject: [Ironruby-core] Code Review: MutableString5
tfpt review /shelveset:MutableString5;REDMOND\tomat
A new implementation for Ruby MutableString and Ruby regular expression
wrappers.
This is just the first pass, w/o optimizations and w/o encodings (Default
system encoding is used for all strings).
Many improvements and adjustments will come in future, some hacks will be
removed.
Basic architecture:
MutableString holds on Content and Encoding. Content is an abstract class
that has three subclasses:
1) StringContent
- Holds on an instance of System.String - an immutable .NET string.
This is the default representation for strings coming from CLR methods and
for Ruby string literals.
- A textual write operation on the mutable string that has this
content representation will cause implicit conversion of the representation
to StringBuilderContent.
- A binary read/write operation triggers a transition to
BinaryContent using the Encoding stored on the owning MutableString.
2) StringBuilderContent
- Holds on an instance of System.Text.StringBuilder - a mutable
Unicode string.
- A binary read/write operation transforms the content to
BinaryContent representation.
- StringBuilder is not optimal for some operations (requires
unnecessary copying), we may consider to replace it with resizable char[].
3) BinaryContent
- A textual read/write operation transforms the content to
StringBuilderContent representation.
- List<byte> is currently used, but it doesn''t fit many
operations
very well. We should replace it by resizable byte[].
The content representation is changed based upon operations that are
performed on the mutable string. There is currently no limit on number of
content type switches, so if one alternates binary and textual operations
the conversion will take place for each one of them. Although this
shouldn''t
be a common case we may consider to add some counters and keep the
representation binary/textual based upon their values.
The design assumes that the nature of operations implemented by library
methods is of two kinds: textual and binary. And that data that are once
treated as text are not usually treated as raw binary data later. Any text
in the IronRuby runtime is represented as a sequence of 16bit Unicode
characters (standard .NET representation). Each binary data treated as text
is converted to this representation, regardless of the encoding used for
storage representation in the file. The encoding is remembered in the
MutableString instance and the original representation could be always
recreated. Not all Unicode characters fit into 16 bits, therefore some
exotic ones are represented by multiple characters (surrogates). If there is
such a character in the string, some operations (e.g. indexing) might not be
precise anymore - the n-th item in the char[] isn''t the n-th Unicode
character in the string (there might be escape characters). We believe this
impreciseness is not a real world issue and is worth performance gain and
implementation simplicity.
Tomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/ironruby-core/attachments/20080511/29ef65b9/attachment-0001.html>
I thought about that, but given that there are like 15 overloads for Append it
might be an unnecessary code duplication to add them for constructors as well.
You can do it on a single line too:
MutableString str = MutableString.CreateBinary(received).Append(buffer, 0,
received);
Append returns the MutableString instance back and you can also specify
estimated capacity to CreateBinary if you know it.
Let''s use this for now and if the patter is very often let''s
consider adding more overloads.
Tomas
From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at
rubyforge.org] On Behalf Of Peter Bacon Darwin
Sent: Sunday, May 11, 2008 5:23 AM
To: ironruby-core at rubyforge.org
Subject: Re: [Ironruby-core] Code Review: MutableString5
One thing that MutableString could do with is
public static MutableString/*!*/ CreateBinary(byte[]/*!*/ bytes, int
start, int length) {
At the moment you have to do something like:
MutableString str = MutableString.CreateBinary();
str.Append(buffer, 0, received);
Pete
From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at
rubyforge.org] On Behalf Of Tomas Matousek
Sent: Saturday,10 May 10, 2008 22:42
To: ironruby-core at rubyforge.org
Subject: Re: [Ironruby-core] Code Review: MutableString5
$KCODE is orthogonal to the encoding in MutableString. $KCODE seems to be just a
value that is used by some library methods that perform binary operations on
textual data. MutableString.Encoding is encoding of the representation. If a
MutableString instance is created from .NET string an encoding that is
associated with it is used whenever the string is consumed by a binary data
operation. We could represent all strings as byte[], but then you''d
need to convert .NET strings to byte[] at the construction time. MutableString
allows you to be lazy and perhaps not perform the conversion at all if not
needed.
Could you give some code sample that you think could be broken?
Tomas
From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at
rubyforge.org] On Behalf Of Peter Bacon Darwin
Sent: Saturday, May 10, 2008 2:27 AM
To: ironruby-core at rubyforge.org
Subject: Re: [Ironruby-core] Code Review: MutableString5
This is a big old diff to search through. I couldn''t work out a way of
easily patching it onto my source at home due to the folder differences.
I really like this hybrid idea and it looks like it will work well. I have one
question with regards to encodings and KCODE.
I appreciate that String is changing between Ruby 1.8 and 1.9. It appears that
this MutableString implementation is leaning more toward the 1.9 implementation
(i.e. holding on to an Encoding within the String itself).
1.8 does hold the encoding and as I understand it the implicit encoding of the
bytes held in a String is driven off KCODE. Is that correct? If so you have a
number of scenarios which I think could cause problems with MutableString
holding on to its own Encoding, which stem from times when KCODE is changed at
runtime. I''ll try to describe a concrete example and you can tell me
where I am going wrong...
Assume that KCODE is set to UTF8. If you create a String from an array of bytes
in Ruby, the bytes are just stored as-is. You can do stuff which is encoding
dependent and UTF8 is assumed.
If you now change KCODE to say EUC, then the bytes in the String are unchanged
but now encoding dependent operations will possibly produce different results on
the same string since they interpret the bytes differently.
The worry I have with MutableString, is that if you create a string from bytes
but then do an operation that requires it to be converted to a CLR string
internally. What happens when you change KCODE? You can''t simply
change the Encoding value of the MutableString, since if you then access the
bytes you will not get the same bytes back as were originally put in. I
suppose, on changing KCODE, you could go through all the strings in memory,
which have been converted from binary to CLR strings, and convert them (i.e.
back to bytes via the old encoding and then to CLR strings via the new
encoding). What would be the optimal solution in this case?
Again, I am not talking from a position of deep knowledge here so I may be
missing something really obvious. But I thought it was worth asking the
question.
Regards,
Pete
From: ironruby-core-bounces at rubyforge.org [mailto:ironruby-core-bounces at
rubyforge.org] On Behalf Of Tomas Matousek
Sent: Friday,09 May 09, 2008 19:08
To: IronRuby External Code Reviewers
Cc: ironruby-core at rubyforge.org
Subject: [Ironruby-core] Code Review: MutableString5
tfpt review /shelveset:MutableString5;REDMOND\tomat
A new implementation for Ruby MutableString and Ruby regular expression
wrappers.
This is just the first pass, w/o optimizations and w/o encodings (Default system
encoding is used for all strings).
Many improvements and adjustments will come in future, some hacks will be
removed.
Basic architecture:
MutableString holds on Content and Encoding. Content is an abstract class that
has three subclasses:
1) StringContent
- Holds on an instance of System.String - an immutable .NET string.
This is the default representation for strings coming from CLR methods and for
Ruby string literals.
- A textual write operation on the mutable string that has this content
representation will cause implicit conversion of the representation to
StringBuilderContent.
- A binary read/write operation triggers a transition to BinaryContent
using the Encoding stored on the owning MutableString.
2) StringBuilderContent
- Holds on an instance of System.Text.StringBuilder - a mutable Unicode
string.
- A binary read/write operation transforms the content to BinaryContent
representation.
- StringBuilder is not optimal for some operations (requires
unnecessary copying), we may consider to replace it with resizable char[].
3) BinaryContent
- A textual read/write operation transforms the content to
StringBuilderContent representation.
- List<byte> is currently used, but it doesn''t fit many
operations very well. We should replace it by resizable byte[].
The content representation is changed based upon operations that are performed
on the mutable string. There is currently no limit on number of content type
switches, so if one alternates binary and textual operations the conversion will
take place for each one of them. Although this shouldn''t be a common
case we may consider to add some counters and keep the representation
binary/textual based upon their values.
The design assumes that the nature of operations implemented by library methods
is of two kinds: textual and binary. And that data that are once treated as text
are not usually treated as raw binary data later. Any text in the IronRuby
runtime is represented as a sequence of 16bit Unicode characters (standard .NET
representation). Each binary data treated as text is converted to this
representation, regardless of the encoding used for storage representation in
the file. The encoding is remembered in the MutableString instance and the
original representation could be always recreated. Not all Unicode characters
fit into 16 bits, therefore some exotic ones are represented by multiple
characters (surrogates). If there is such a character in the string, some
operations (e.g. indexing) might not be precise anymore - the n-th item in the
char[] isn''t the n-th Unicode character in the string (there might be
escape characters). We believe this impreciseness is not a real world issue and
is worth performance gain and implementation simplicity.
Tomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/ironruby-core/attachments/20080511/b90aa9e8/attachment-0001.html>