Peter Bishop
2020-Aug-25 19:16 UTC
[R] Matching backslash in a table's column using R language
The feed is coming from a SQL table and this is using the embedded support for R which comes with SQL 2016. The source is therefore a SELECT statement. As an aside, I found a workaround by changing the pattern from: "[\x22\x27\x2c\x3f\x5c\x60]" to: "[\x22\x27\x2c\x3f\x5c\x5c\x60]" This seems to be escaping the backslash in the R script rather than in the data - which confuses me. ________________________________ From: Bert Gunter <bgunter.4567 at gmail.com> Sent: Wednesday, 26 August 2020 4:26 AM To: Peter Bishop <bishop_peterj at hotmail.com> Cc: r-help at r-project.org <r-help at r-project.org> Subject: Re: [R] Matching backslash in a table's column using R language 1. I am far from an expert on such matters 2. It is unclear to me what your input is -- I assume a file. The problem, as you indicate, is that R's parser sees "\B" as an incorrect escape character, so, for example:> cat("\B")Error: '\B' is an unrecognized escape in character string starting ""\B" In any case, I think you should look at ?scan. Here is an example where I scan from the keyboard first and then remove the "\". You may have to scan from a file to do this.> z <-scan(file = "", what = "character")1: A\BCDEFG 2: #CR terminates input Read 1 item> cat(z)A\BCDEFG> nchar(z)[1] 8 ## scan read in the "\" as a single character from the console.> sub("\\\\","",z) ## Yes, 4 backslashes[1] "ABCDEFG" There may be better ways to do this, but as I said, I'm no expert. BTW, in posting here, please post in *plain text,* as the server can mangle html. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Aug 25, 2020 at 9:02 AM Peter Bishop <bishop_peterj at hotmail.com<mailto:bishop_peterj at hotmail.com>> wrote: In SQL, I'm using R as a way to filter data based on: - 20 characters in the range <space> to <tilde> - excluding <quote>, <apostrophe>, <comma>, <question mark>, <backslash>, <backtick> Given a SQL column containing the data: code ---- A\BCDEFG and the T-SQL script: EXEC [sys].[sp_execute_external_script] @language=N'R', @script=N' pattern1 = "^[\x20-\x7e]{1,20}$" pattern2 = "[\x22\x27\x2c\x3f\x5c\x60]" outData <- subset(inData, grepl(pattern1, code, perl=TRUE) & !grepl(pattern2, code, perl=TRUE))', @input_data_1 = N'SELECT [code] FROM [dbo].[products]', @input_data_1_name = N'inData', @output_data_1_name = N'outData' WITH RESULT SETS (AS OBJECT [dbo].[products]); GO why does the row detailed above get returned? I know that backslash is a special character but not in the SQL table. Consequently, the T-SQL code: SELECT ASCII(SUBSTRING([value], 2, 1)) FROM [table] returns 92 (the ASCII code for <backslash>) which shows that this is being recognised as a backslash character and not as an escape indicator for the following "B". Can anyone advise how I can filter out the <backslash> in the way that the other identified characters are being successfully filtered? As the data is being retrieved from a table, I can?t ask the data provider to use ?\\? instead of ?\? as that will be invalid for other uses. Thanks. [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7C%7C21350c4700bf48cb035008d849245d6c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339767813946809&sdata=v79GeIK6IXhKuF4q0qJ3uopMdJeYzKKvdi5a4NqgdeM%3D&reserved=0> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html<https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=02%7C01%7C%7C21350c4700bf48cb035008d849245d6c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339767813956802&sdata=iKgAGe23wPBrIK4iI3q0Vqk19q%2B%2FgSgiRUV858XOU3A%3D&reserved=0> and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
Jeff Newmiller
2020-Aug-25 19:53 UTC
[R] Matching backslash in a table's column using R language
In my opinion, using hexadecimal ASCII is much more obscure than simply using the escape character properly... that is, you are doing no-one any favors by using them. But to attain clarity here, you need to envision what the various software layers are doing. In your case, SQLServer may not utilize escape character, but it is passing your R code to the R interpreter, which does use the escape character to convert source code into strings in memory, which are then passed into the regex parser, which is the final layer that also handles the same escape character. What may be confusing you is the distinction between what is in memory that the regex parser sees: ["',?\\`] and what the R string literal looks like that you should type to get this string into memory: "[\"',?\\\\`]" When you pass the latter literal to the cat() function, it will show you the former version. When you have the literal stored in memory, you can use the print() function to see what you have to type as a literal string to get the in-memory version. I use this trick (cat) to help me zero in on what is actually getting passed to the regex engine when I have difficulty envisioning what is going on. The regex engine needs that doubled backslash to recognize that _it_ should not give special treatment to the \ there, and should look for it in the input data. On August 25, 2020 12:16:35 PM PDT, Peter Bishop <bishop_peterj at hotmail.com> wrote:>The feed is coming from a SQL table and this is using the embedded >support for R which comes with SQL 2016. The source is therefore a >SELECT statement. > > >As an aside, I found a workaround by changing the pattern from: > > >"[\x22\x27\x2c\x3f\x5c\x60]" > > >to: > > >"[\x22\x27\x2c\x3f\x5c\x5c\x60]" > > >This seems to be escaping the backslash in the R script rather than in >the data - which confuses me. > >________________________________ >From: Bert Gunter <bgunter.4567 at gmail.com> >Sent: Wednesday, 26 August 2020 4:26 AM >To: Peter Bishop <bishop_peterj at hotmail.com> >Cc: r-help at r-project.org <r-help at r-project.org> >Subject: Re: [R] Matching backslash in a table's column using R >language > >1. I am far from an expert on such matters >2. It is unclear to me what your input is -- I assume a file. > >The problem, as you indicate, is that R's parser sees "\B" as an >incorrect escape character, so, for example: >> cat("\B") >Error: '\B' is an unrecognized escape in character string starting >""\B" > >In any case, I think you should look at ?scan. Here is an example where >I scan from the keyboard first and then remove the "\". You may have to >scan from a file to do this. > >> z <-scan(file = "", what = "character") >1: A\BCDEFG >2: #CR terminates input >Read 1 item > >> cat(z) >A\BCDEFG > >> nchar(z) >[1] 8 ## scan read in the "\" as a single character from the console. > >> sub("\\\\","",z) ## Yes, 4 backslashes >[1] "ABCDEFG" > >There may be better ways to do this, but as I said, I'm no expert. > >BTW, in posting here, please post in *plain text,* as the server can >mangle html. > >Cheers, >Bert > > >Bert Gunter > >"The trouble with having an open mind is that people keep coming along >and sticking things into it." >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > >On Tue, Aug 25, 2020 at 9:02 AM Peter Bishop ><bishop_peterj at hotmail.com<mailto:bishop_peterj at hotmail.com>> wrote: >In SQL, I'm using R as a way to filter data based on: > - 20 characters in the range <space> to <tilde> >- excluding <quote>, <apostrophe>, <comma>, <question mark>, ><backslash>, <backtick> > >Given a SQL column containing the data: > > code > ---- > A\BCDEFG > >and the T-SQL script: > > EXEC [sys].[sp_execute_external_script] > @language=N'R', > @script=N' > pattern1 = "^[\x20-\x7e]{1,20}$" > pattern2 = "[\x22\x27\x2c\x3f\x5c\x60]" > >outData <- subset(inData, grepl(pattern1, code, perl=TRUE) & >!grepl(pattern2, code, perl=TRUE))', > @input_data_1 = N'SELECT [code] FROM [dbo].[products]', > @input_data_1_name = N'inData', > @output_data_1_name = N'outData' > WITH > RESULT SETS (AS OBJECT [dbo].[products]); > GO > >why does the row detailed above get returned? I know that backslash is >a special character but not in the SQL table. Consequently, the T-SQL >code: > > SELECT ASCII(SUBSTRING([value], 2, 1)) FROM [table] > >returns 92 (the ASCII code for <backslash>) which shows that this is >being recognised as a backslash character and not as an escape >indicator for the following "B". > >Can anyone advise how I can filter out the <backslash> in the way that >the other identified characters are being successfully filtered? As the >data is being retrieved from a table, I can?t ask the data provider to >use ?\\? instead of ?\? as that will be invalid for other uses. > >Thanks. > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To >UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7C%7C21350c4700bf48cb035008d849245d6c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339767813946809&sdata=v79GeIK6IXhKuF4q0qJ3uopMdJeYzKKvdi5a4NqgdeM%3D&reserved=0> >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html<https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=02%7C01%7C%7C21350c4700bf48cb035008d849245d6c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339767813956802&sdata=iKgAGe23wPBrIK4iI3q0Vqk19q%2B%2FgSgiRUV858XOU3A%3D&reserved=0> >and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]]-- Sent from my phone. Please excuse my brevity.
Peter Bishop
2020-Aug-25 20:19 UTC
[R] Matching backslash in a table's column using R language
To be honest, I've only used the hex values as that was the format in which the patterns were passed to me. However from your explanation, I now understand what's going on. I didn't appreciate that the characters were passed to another layer and not seeing the hex code as the raw backslash. Many thanks for the explanation.> On 25 Aug 2020, at 20:53, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote: > > ?In my opinion, using hexadecimal ASCII is much more obscure than simply using the escape character properly... that is, you are doing no-one any favors by using them. But to attain clarity here, you need to envision what the various software layers are doing. > > In your case, SQLServer may not utilize escape character, but it is passing your R code to the R interpreter, which does use the escape character to convert source code into strings in memory, which are then passed into the regex parser, which is the final layer that also handles the same escape character. > > What may be confusing you is the distinction between what is in memory that the regex parser sees: > > ["',?\\`] > > and what the R string literal looks like that you should type to get this string into memory: > > "[\"',?\\\\`]" > > When you pass the latter literal to the cat() function, it will show you the former version. When you have the literal stored in memory, you can use the print() function to see what you have to type as a literal string to get the in-memory version. I use this trick (cat) to help me zero in on what is actually getting passed to the regex engine when I have difficulty envisioning what is going on. > > The regex engine needs that doubled backslash to recognize that _it_ should not give special treatment to the \ there, and should look for it in the input data. > >> On August 25, 2020 12:16:35 PM PDT, Peter Bishop <bishop_peterj at hotmail.com> wrote: >> The feed is coming from a SQL table and this is using the embedded >> support for R which comes with SQL 2016. The source is therefore a >> SELECT statement. >> >> >> As an aside, I found a workaround by changing the pattern from: >> >> >> "[\x22\x27\x2c\x3f\x5c\x60]" >> >> >> to: >> >> >> "[\x22\x27\x2c\x3f\x5c\x5c\x60]" >> >> >> This seems to be escaping the backslash in the R script rather than in >> the data - which confuses me. >> >> ________________________________ >> From: Bert Gunter <bgunter.4567 at gmail.com> >> Sent: Wednesday, 26 August 2020 4:26 AM >> To: Peter Bishop <bishop_peterj at hotmail.com> >> Cc: r-help at r-project.org <r-help at r-project.org> >> Subject: Re: [R] Matching backslash in a table's column using R >> language >> >> 1. I am far from an expert on such matters >> 2. It is unclear to me what your input is -- I assume a file. >> >> The problem, as you indicate, is that R's parser sees "\B" as an >> incorrect escape character, so, for example: >>> cat("\B") >> Error: '\B' is an unrecognized escape in character string starting >> ""\B" >> >> In any case, I think you should look at ?scan. Here is an example where >> I scan from the keyboard first and then remove the "\". You may have to >> scan from a file to do this. >> >>> z <-scan(file = "", what = "character") >> 1: A\BCDEFG >> 2: #CR terminates input >> Read 1 item >> >>> cat(z) >> A\BCDEFG >> >>> nchar(z) >> [1] 8 ## scan read in the "\" as a single character from the console. >> >>> sub("\\\\","",z) ## Yes, 4 backslashes >> [1] "ABCDEFG" >> >> There may be better ways to do this, but as I said, I'm no expert. >> >> BTW, in posting here, please post in *plain text,* as the server can >> mangle html. >> >> Cheers, >> Bert >> >> >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along >> and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Tue, Aug 25, 2020 at 9:02 AM Peter Bishop >> <bishop_peterj at hotmail.com<mailto:bishop_peterj at hotmail.com>> wrote: >> In SQL, I'm using R as a way to filter data based on: >> - 20 characters in the range <space> to <tilde> >> - excluding <quote>, <apostrophe>, <comma>, <question mark>, >> <backslash>, <backtick> >> >> Given a SQL column containing the data: >> >> code >> ---- >> A\BCDEFG >> >> and the T-SQL script: >> >> EXEC [sys].[sp_execute_external_script] >> @language=N'R', >> @script=N' >> pattern1 = "^[\x20-\x7e]{1,20}$" >> pattern2 = "[\x22\x27\x2c\x3f\x5c\x60]" >> >> outData <- subset(inData, grepl(pattern1, code, perl=TRUE) & >> !grepl(pattern2, code, perl=TRUE))', >> @input_data_1 = N'SELECT [code] FROM [dbo].[products]', >> @input_data_1_name = N'inData', >> @output_data_1_name = N'outData' >> WITH >> RESULT SETS (AS OBJECT [dbo].[products]); >> GO >> >> why does the row detailed above get returned? I know that backslash is >> a special character but not in the SQL table. Consequently, the T-SQL >> code: >> >> SELECT ASCII(SUBSTRING([value], 2, 1)) FROM [table] >> >> returns 92 (the ASCII code for <backslash>) which shows that this is >> being recognised as a backslash character and not as an escape >> indicator for the following "B". >> >> Can anyone advise how I can filter out the <backslash> in the way that >> the other identified characters are being successfully filtered? As the >> data is being retrieved from a table, I can?t ask the data provider to >> use ?\\? instead of ?\? as that will be invalid for other uses. >> >> Thanks. >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To >> UNSUBSCRIBE and more, see >> https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7C%7C61b6372acf8c4d215faa08d8493093e1%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339820268206663&sdata=9t4wAFzYNfo%2B%2BITjORuSPUVNDtcSHm5hJN8yYGnEUnU%3D&reserved=0<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7C%7C61b6372acf8c4d215faa08d8493093e1%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339820268206663&sdata=9t4wAFzYNfo%2B%2BITjORuSPUVNDtcSHm5hJN8yYGnEUnU%3D&reserved=0> >> PLEASE do read the posting guide >> https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=02%7C01%7C%7C61b6372acf8c4d215faa08d8493093e1%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339820268206663&sdata=UVGq0iXkHMdLFC2oL1l5WG730HW1fvEJPDFaiHS3a98%3D&reserved=0<https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=02%7C01%7C%7C61b6372acf8c4d215faa08d8493093e1%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339820268206663&sdata=UVGq0iXkHMdLFC2oL1l5WG730HW1fvEJPDFaiHS3a98%3D&reserved=0> >> and provide commented, minimal, self-contained, reproducible code. >> >> [[alternative HTML version deleted]] > > -- > Sent from my phone. Please excuse my brevity. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7C%7C61b6372acf8c4d215faa08d8493093e1%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339820268206663&sdata=9t4wAFzYNfo%2B%2BITjORuSPUVNDtcSHm5hJN8yYGnEUnU%3D&reserved=0 > PLEASE do read the posting guide https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=02%7C01%7C%7C61b6372acf8c4d215faa08d8493093e1%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637339820268206663&sdata=UVGq0iXkHMdLFC2oL1l5WG730HW1fvEJPDFaiHS3a98%3D&reserved=0 > and provide commented, minimal, self-contained, reproducible code.