Hey all, So, I'm attempting to decode some (and I don't know why anyone did this) URl-encoded user agents. Running URLdecode over them generates the error: "Error in rawToChar(out) : embedded nul in string" Okay, so there's an embedded nul - fair enough. Presumably decoding the URL is exposing it in a format R doesn't like. Except when I try to dig down and work out what an encoded nul looks like, in order to simply remove them with something like gsub(), I end up with several different strings, all of which apparently resolve to an embedded nul:> URLdecode("0;%20@%gIL")Error in rawToChar(out) : embedded nul in string: '0; @\0L' In addition: Warning message: In URLdecode("0;%20@%gIL") : out-of-range values treated as 0 in coercion to raw> URLdecode("%20%use")Error in rawToChar(out) : embedded nul in string: ' \0e' In addition: Warning message: In URLdecode("%20%use") : out-of-range values treated as 0 in coercion to raw I'm a relative newb to encodings, so maybe the fault is simply in my understanding of how this should work, but - why are both strings being read as including nuls, despite having different values? And how would I go about removing said nuls? -- Oliver Keyes Research Analyst Wikimedia Foundation [[alternative HTML version deleted]]
I would guess that the original URLs were encoded somehow (non-ASCII), and the person who received them didn't understand how to deal with them either and url-encoded them with the thought that they would not lose information that way. Unfortunately, they probably lost the meta information as to how they were originally encoded, and without that this turns into a detective job that will likely need C's ability (perhaps via RCpp) to ignore type information to put things back. If you are lucky all strings were originally encoded the same way... if really lucky they were all UTF8 or UTF16 (which would have nuls and other odd bytes). Proceeding with the broken strings you have now will almost certainly not work. The fragments shown are not even vaguely recognizable as URLs, so I don't see how we can do anything meaningful with them. Please read the Posting Guide. One point made there to note is that if C becomes part of the question then R-devel becomes the more appropriate list. The other is that for all of these lists plain text email is expected (nor HTML). --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On September 1, 2014 9:02:33 AM PDT, Oliver Keyes <okeyes at wikimedia.org> wrote:>Hey all, > >So, I'm attempting to decode some (and I don't know why anyone did >this) >URl-encoded user agents. Running URLdecode over them generates the >error: > >"Error in rawToChar(out) : embedded nul in string" > >Okay, so there's an embedded nul - fair enough. Presumably decoding the >URL >is exposing it in a format R doesn't like. Except when I try to dig >down >and work out what an encoded nul looks like, in order to simply remove >them >with something like gsub(), I end up with several different strings, >all of >which apparently resolve to an embedded nul: > >> URLdecode("0;%20@%gIL") >Error in rawToChar(out) : embedded nul in string: '0; @\0L' >In addition: Warning message: >In URLdecode("0;%20@%gIL") : > out-of-range values treated as 0 in coercion to raw >> URLdecode("%20%use") >Error in rawToChar(out) : embedded nul in string: ' \0e' >In addition: Warning message: >In URLdecode("%20%use") : > out-of-range values treated as 0 in coercion to raw > >I'm a relative newb to encodings, so maybe the fault is simply in my >understanding of how this should work, but - why are both strings being >read as including nuls, despite having different values? And how would >I go >about removing said nuls?
Hi Oliver, I think you're being misled by the default behaviour of warnings: they all get displayed at once, before control returns to the console. If you making them immediate, you get a slightly more informative error:> URLdecode("0;%20@%gIL")Warning in URLdecode("0;%20@%gIL") : out-of-range values treated as 0 in coercion to raw Error in rawToChar(out) : embedded nul in string: '0; @\0L' So the out of range value (%g...) is getting converted to a raw(0), aka a nul. Then rawToChar() chokes. The code for URLdecode is simple enough that I'd recommend rewriting yourself to better handle bad inputs. Hadley On Mon, Sep 1, 2014 at 11:02 AM, Oliver Keyes <okeyes at wikimedia.org> wrote:> Hey all, > > So, I'm attempting to decode some (and I don't know why anyone did this) > URl-encoded user agents. Running URLdecode over them generates the error: > > "Error in rawToChar(out) : embedded nul in string" > > Okay, so there's an embedded nul - fair enough. Presumably decoding the URL > is exposing it in a format R doesn't like. Except when I try to dig down > and work out what an encoded nul looks like, in order to simply remove them > with something like gsub(), I end up with several different strings, all of > which apparently resolve to an embedded nul: > >> URLdecode("0;%20@%gIL") > Error in rawToChar(out) : embedded nul in string: '0; @\0L' > In addition: Warning message: > In URLdecode("0;%20@%gIL") : > out-of-range values treated as 0 in coercion to raw >> URLdecode("%20%use") > Error in rawToChar(out) : embedded nul in string: ' \0e' > In addition: Warning message: > In URLdecode("%20%use") : > out-of-range values treated as 0 in coercion to raw > > I'm a relative newb to encodings, so maybe the fault is simply in my > understanding of how this should work, but - why are both strings being > read as including nuls, despite having different values? And how would I go > about removing said nuls? > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- http://had.co.nz/