thr3ads.net - llvm dev - [LLVMdev] Correct use of StringRef and Twine [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Chris Lattner

2011-Jul-23 00:29 UTC

[LLVMdev] Correct use of StringRef and Twine

On Jul 22, 2011, at 2:59 PM, David Blaikie wrote:
>> The dangerous part of this is that characters are integers, so
"foo" + 'x' is very likely to cause serious problems.
> 
> std::string already provides such overloads though, doesn't it? So the
> code isn't any safer from accidental "foo" + 'x'
expressions that
> don't include Twine/StringRef/std::string than it was before. But if
> the argument is that std::string's interface was poorly
> designed/unsafe & we can do better/safer, I'm OK with making the
ctor
> explicit as you've suggested.
Yes, exactly.  I'm just saying that I think the additional clarity of:
  "foo" + Twine('x') 

is worth the inconvenience.
>>  You should also probably add a ctor for signed/unsigned char as well
(which reuse the existing CharKind enum).
> 
> Hmm - would it be safe to cast those signed/unsigned chars to straight
> char? (is it guaranteed that the signed & unsigned values with the
> same representation map to the same glyph?)
Yes.  I consider 'signed vs unsigned char vs char' to be a blight on the
C type system.  Just casting to char internally would be fine.
> As a side note on Twine's design: Is there a particular reason it uses
> void*s rather than unions? 
I'm not sure what you're proposing specifically.
> and chars rather than enums?
char vs enum is because of visual studio compatibility and because enums often
are stored as 32-bit values instead of 8-bit values.
> (sorry if I'm asking lots of "why is this like this"
questions all
> over the code base - I just don't want to assume that it's
intentional
> and replicate a pattern elsewhere that I don't understand only to find
> it's unintentional "not fixed yet" sort of stuff. I suppose
at the
> very least it'll be a chance to add in some explanatory comments if I
> do find things that are by design but weren't clear to me)
No problem at all, happy to help answer the questions.  Forcing a reexamination
of past decisions is not a bad thing at all in this case :)
>> Right.  Something like this could work:
>> 
>> foo(const Twine &T) {
>> ...
>>  TwineTmpBuffer Tmp;
>>  StringSwitch(T.toString(Tmp)).....
>> 
>> Which doesn't seem too horrible, just needs a typedef of
smallvector to TwineTmpBuffer.
> 
> In a few choice places, maybe, but as the default way to pass string
> parameters I think that'd be a hard sell as a general practice.

Yes, from a general API design perspective, I hate having to force a choice
between "convenience to implementor of an API to just take a
StringRef" vs "convenience to client of API for it to take a
Twine".  It really stinks.

I was chatting with Howard Hinnant about this and he suggested replacing Twine
with a template metaprogramming expression-template system.  I haven't
thought through all the details, but perhaps this would allow us to get the best
of both worlds?
> Hmm, this is one bit I'm not sure about. As I tried to explain above,
> it seems problematic to have to choose your argument type on the basis
> of how you think callers might use your API. From the perspective of a
> caller, a Twine argument is at least as expressive as a StringRef
> (since all StringRefs can be Twined implicitly), but it takes that
> extra step to write the implementation.
> 
> Perhaps I'm aiming for some kind of purist/perfectionist argument that
> isn't necessary or practical, but I hope I've been clear in
explaining
> my uncertainty/issue here.

Yes, I'm deeply unhappy about this aspect of our string api's.

-Chris

David Blaikie

2011-Jul-24 07:09 UTC

head link

[LLVMdev] Correct use of StringRef and Twine

> Yes, exactly.  I'm just saying that I think the additional clarity of:
>  "foo" + Twine('x')
>
> is worth the inconvenience.
Ok, attached a modified version of my patch with an Twine(char),
Twine(unsigned char), and Twine(signed char). All three are explicit &
have included test cases.

Speaking of which - is there any LLVM dev policy or preference around
C-style casts versus C++-style casts? Twine is littered with the
former but I generally use the latter for safety/clarity of intend,
etc. (I'm not going to try to fix everything in Twine just now if
we're considering alternative implementations like template
metaprogramming anyway)
>> As a side note on Twine's design: Is there a particular reason it
uses
>> void*s rather than unions?
>
> I'm not sure what you're proposing specifically.
I've attached an version of my fix that includes the union use I was
referring to (sorry for being vague in the previous email). Though if
we're considering a more explicitly structured solution with
templates/subtypes/etc we could skip the whole mess of enums &
unions/casts.
>> and chars rather than enums?
>
> char vs enum is because of visual studio compatibility and because enums
often are stored as 32-bit values instead of 8-bit values.
Curious - seems VS >2k3 (
http://msdn.microsoft.com/en-us/library/2dzy4k6e(v=VS.80).aspx ) has
supported specifying enum backing type (so you can specify it as char:
"enum foo : char { ... }") if that was the feature you were wanting to
use but avoiding.
> Yes, from a general API design perspective, I hate having to force a choice
between "convenience to implementor of an API to just take a
StringRef" vs "convenience to client of API for it to take a
Twine".  It really stinks.
Ok - glad we're on the same page there, wasn't quite sure whether we
were talking past one another in previous emails.

I have a couple of ideas about how to lower the developer cost of
using Twine ubiquitously. I'll try to describe them in some detail.
> I was chatting with Howard Hinnant about this and he suggested replacing
Twine with a template metaprogramming expression-template system.  I haven't
thought through all the details, but perhaps this would allow us to get the best
of both worlds?
I think I'd prefer a template metaprogramming solution to Twine just
from a purist perspective compared to all the
casting/reinterpreting/enums (even with the union) - that sort of
stuff makes me a bit uncomfortable, but I realize it's well contained
enough that it's not too risky (& I know some of the other tricks LLVM
uses like low bit twiddling in pointers, etc, so this seems relatively
sane in the grand scheme of things). But I'm not sure it'll
fundamentally change how we might be able to solve this caller
convenience issue.

So here are my ideas:

1) the easy solution: create a StringRef subclass (or new type with a
similarly expressive API) that wraps a buffer that it may or may not
use. Give it a ctor that takes a Twine. The use case then goes from
this:

    SmallVectorImpl<char> v;
    StringRef s = theTwine.GetStringRef(v);

to this:

    TwineString s(theTwine);

(& the ctor looks like: TwineString(const Twine& t) { *this
t.GetStringRef(this->buffer); })
So any Twine sink now has to pay one line of code to get a usable
string out of a Twine. Cheap to implement, easy to fix existing use
cases (privatize GetStringRef(SmallVectorImpl<char>) and make
TwineString a friend - fix everything that fails to compile).

This keeps things simple & seems to be "good enough" to me, but we
could perhaps do better (at the very least, again, if we did do
better, we could go back & remove TwineString & again fix all the
places that fail to compile with whatever new hotness we invent)

Hmm - new thought: what would happen if my argument type was simply
TwineString? If I knew my function needed access to the values (like
the Host.cpp example) but didn't have any buffer of its own that it
needed to use, that seems OK. This new thought perhaps renders
suggestion (2) uninteresting. Functions that know they always need the
string contents take a TwineString. Functions that might be able to
skip realizing the string entirely - or have their own buffer (eg: a
member std::string) to populate - take a Twine & explicitly construct
a TwineString only in the path that needs the value. That seems
optimal to me.

1.1) We probably want a better way to get a Twine into a std::string -
while the change to Triple's ctor works because returning a
std::string from a function into a ctor can do fancy in-place
construction - in any other case like "SetFoo(const Twine& t) {
this->s = t; }" we'd want something better. Perhaps simply
t.AssignTo(s);

2) This one isn't quite as fleshed out, but see if the general outline
makes sense: Introduce an extra type that's Twine-esque, but is only
the top level (so it's implicitly constructible from a const Twine&),
rather than all the branches of a Twine. If a function wants to accept
a Twine that it needs to manipulate, it can mark its argument as
TwineString2 (for want of a better name, just using 2 to distinguish
from the (1) example). TwineString2 has a buffer in it, but it starts
uninitialized. Any attempt to retrieve/inspect values causes the
string to be initialized. [downside: all string operations on a
TwineString2 incur an "if not initialized" check]

Twines could be constructed from TwineString2s if the value needed to
be passed on to another function. (& still take advantage of any
initialized string buffer it might have, rather than having to
remanifest from its child nodes repeatedly)

One fancy enhancement to this is that TwineString2 could partially
manifest - eg: if you want ts[2] it could walk its children (dropping
its references to children & updating the buffer with the contents at
each step) until it found the 2nd character - then stop, with a
partially expanded tree/buffer. Though there's probably nothing
stopping us implementing such smarts in the (1) solution eventually
either.

Thanks,
- David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: twine_explicit_char.patch
Type: application/octet-stream
Size: 4365 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110724/852ca59f/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: twine_explicit_char_union.patch
Type: application/octet-stream
Size: 15243 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110724/852ca59f/attachment-0001.obj>

Chris Lattner

2011-Jul-24 20:58 UTC

head link

[LLVMdev] Correct use of StringRef and Twine

On Jul 24, 2011, at 12:09 AM, David Blaikie wrote:>> Yes, exactly.  I'm just saying that I think the additional clarity
of:
>>  "foo" + Twine('x')
>> 
>> is worth the inconvenience.
> 
> Ok, attached a modified version of my patch with an Twine(char),
> Twine(unsigned char), and Twine(signed char). All three are explicit &
> have included test cases.
Ok, great.
> Speaking of which - is there any LLVM dev policy or preference around
> C-style casts versus C++-style casts? Twine is littered with the
Nope, use the one that is the most clear in a given case.  Except in extreme
cases, the C++ style casts seem overly verbose and don't add a lot of value.
>>> As a side note on Twine's design: Is there a particular reason
it uses
>>> void*s rather than unions?
>> 
>> I'm not sure what you're proposing specifically.
> 
> I've attached an version of my fix that includes the union use I was
> referring to (sorry for being vague in the previous email). Though if
> we're considering a more explicitly structured solution with
> templates/subtypes/etc we could skip the whole mess of enums &
> unions/casts.
aha, that *is* much nicer!  Applied with a few tweaks (remove some extraneous
'explicit' keywords and wrapped to 80 cols).  Thanks!
>>> and chars rather than enums?
>> 
>> char vs enum is because of visual studio compatibility and because
enums often are stored as 32-bit values instead of 8-bit values.
> 
> Curious - seems VS >2k3 (
> http://msdn.microsoft.com/en-us/library/2dzy4k6e(v=VS.80).aspx ) has
> supported specifying enum backing type (so you can specify it as char:
> "enum foo : char { ... }") if that was the feature you were
wanting to
> use but avoiding.
Right, but that requires VS #ifdefs.  You can also use enum bitfields, but VS
has different promotion rules from them than other compilers.
>> Yes, from a general API design perspective, I hate having to force a
choice between "convenience to implementor of an API to just take a
StringRef" vs "convenience to client of API for it to take a
Twine".  It really stinks.
> 
> Ok - glad we're on the same page there, wasn't quite sure whether
we
> were talking past one another in previous emails.
> 
> I have a couple of ideas about how to lower the developer cost of
> using Twine ubiquitously. I'll try to describe them in some detail.
Ok!
>> I was chatting with Howard Hinnant about this and he suggested
replacing Twine with a template metaprogramming expression-template system.  I
haven't thought through all the details, but perhaps this would allow us to
get the best of both worlds?
> 
> I think I'd prefer a template metaprogramming solution to Twine just
> from a purist perspective compared to all the
> casting/reinterpreting/enums (even with the union) - that sort of
> stuff makes me a bit uncomfortable, but I realize it's well contained
> enough that it's not too risky (& I know some of the other tricks
LLVM
> uses like low bit twiddling in pointers, etc, so this seems relatively
> sane in the grand scheme of things).
Right.  As long as the damage is well contained and isolated in the class, I
don't care much about stuff like this.  One of the great features of C++ is
that it lets you do whatever insane thing you want and draw a nice interface
around it :)
> 1) the easy solution: create a StringRef subclass (or new type with a
> similarly expressive API) that wraps a buffer that it may or may not
> use. Give it a ctor that takes a Twine. The use case then goes from
> this:
> 
>    SmallVectorImpl<char> v;
>    StringRef s = theTwine.GetStringRef(v);
> 
> to this:
> 
>    TwineString s(theTwine);
> 
> (& the ctor looks like: TwineString(const Twine& t) { *this >
t.GetStringRef(this->buffer); })
> So any Twine sink now has to pay one line of code to get a usable
> string out of a Twine. Cheap to implement, easy to fix existing use
> cases (privatize GetStringRef(SmallVectorImpl<char>) and make
> TwineString a friend - fix everything that fails to compile).
> 
> This keeps things simple & seems to be "good enough" to me,
but we
> could perhaps do better (at the very least, again, if we did do
> better, we could go back & remove TwineString & again fix all the
> places that fail to compile with whatever new hotness we invent)
This definitely seems like a step forward.  While it is kinda gross, a subclass
of StringRef is probably the lowest friction path to do this.
> Hmm - new thought: what would happen if my argument type was simply
> TwineString? If I knew my function needed access to the values (like
> the Host.cpp example) but didn't have any buffer of its own that it
> needed to use, that seems OK. This new thought perhaps renders
> suggestion (2) uninteresting. Functions that know they always need the
> string contents take a TwineString. Functions that might be able to
> skip realizing the string entirely - or have their own buffer (eg: a
> member std::string) to populate - take a Twine & explicitly construct
> a TwineString only in the path that needs the value. That seems
> optimal to me.
I'm dubious of this.  This is exposing an implementation detail to clients
that they shouldn't have to know, and if passed by value, this could be
expensive.
> 1.1) We probably want a better way to get a Twine into a std::string -
> while the change to Triple's ctor works because returning a
> std::string from a function into a ctor can do fancy in-place
> construction - in any other case like "SetFoo(const Twine& t) {
> this->s = t; }" we'd want something better. Perhaps simply
> t.AssignTo(s);
Adding a Twine::assignTo method would make sense to me!
> 2) This one isn't quite as fleshed out, but see if the general outline
> makes sense: Introduce an extra type that's Twine-esque, but is only
> the top level (so it's implicitly constructible from a const
Twine&),
> rather than all the branches of a Twine. If a function wants to accept
> a Twine that it needs to manipulate, it can mark its argument as
> TwineString2 (for want of a better name, just using 2 to distinguish
> from the (1) example). TwineString2 has a buffer in it, but it starts
> uninitialized. Any attempt to retrieve/inspect values causes the
> string to be initialized. [downside: all string operations on a
> TwineString2 incur an "if not initialized" check]
> 
> Twines could be constructed from TwineString2s if the value needed to
> be passed on to another function. (& still take advantage of any
> initialized string buffer it might have, rather than having to
> remanifest from its child nodes repeatedly)
> 
> One fancy enhancement to this is that TwineString2 could partially
> manifest - eg: if you want ts[2] it could walk its children (dropping
> its references to children & updating the buffer with the contents at
> each step) until it found the 2nd character - then stop, with a
> partially expanded tree/buffer. Though there's probably nothing
> stopping us implementing such smarts in the (1) solution eventually
> either.
I think you're flying way over my head here :)

-Chris

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Jul 2011 - [LLVMdev] Correct use of StringRef and Twine

[LLVMdev] Correct use of StringRef and Twine

[LLVMdev] Correct use of StringRef and Twine

[LLVMdev] Correct use of StringRef and Twine

Possibly Parallel Threads