Matt Olson
2008-Mar-22 20:35 UTC
Postgres binary_to_string and string_to_binary type casts
I''m using Postgres 8.2.4 with the postgres-0.7.9.2008.01.28 gem in a Rails 2.0.2 app. I have a model with a binary column and for the most part it works just fine. However, I have two test cases that fail. The data I put into the model comes back out truncated to the first null byte. (The older postgres-0.7.1 driver gives a slightly different, but still corrupt result.) After much debugging, I''ve discovered the problem. It lies in ActiveRecord::ConnectionAdapters::PostgreSQLColumn. The heuristic used to determine whether a column''s value has been previously encoded by escape_bytea is not reliable. In binary_to_string, it checks whether it should unescape a value by looking for the tell-tale sequence \nnn. However, in the two real- world test cases below, these sequences appear in the original data. So unescape_bytea gets called on a block of data that was not previously escaped, thereby mucking it up pretty good. The two real-world test cases that triggered this bug''s discovery can be found here: http://rubycloud.com/pg/base.tsz http://rubycloud.com/pg/1288.1024.jpg In base.tsz, the sequence \690 appears at byte offset 64,688. In 1288.1024.jpg, the sequence \754 appears at byte offset 27,316. It seems that we need a more reliable way to mark a data block as previously escaped by escape_bytea. This is a somewhat tricky problem. We could save a marker with the data and then remove it when unescaping it, but this would mean that the persisted data would only be usable from within Rails, and that this marker could never be changed, and that any other code that works with the database would have to know about this trick. I''ve developed a patch that does this and it fixes my problem, but haven''t submitted it because it''s an ugly solution. Any other ideas? Thanks, Matt --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
Michael Koziarski
2008-Mar-23 23:25 UTC
Re: Postgres binary_to_string and string_to_binary type casts
> Any other ideas?I don''t quite see why we need to check, but I don''t really store blobs so I could be missing something obvious. When we''re storing strings into varchar columns we don''t need to decide whether or not to escape the values, we simply remove a few key values when sticking them into queries and the rest is left untouched? -- Cheers Koz --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
Matt Olson
2008-Mar-24 05:12 UTC
Re: Postgres binary_to_string and string_to_binary type casts
Postgres has specific requirements for binary string storage. All non- printable characters need to be escaped with a \nnn sequence where nnn is the octal representation of the character. See http://www.postgresql.org/docs/8.2/static/datatype-binary.html for details. So in the type cast code (binary_to_string and string_to_binary) it needs to check whether the value stored in the attributes hash has actually gone through this process, or is just some data that was set but hasn''t been persisted yet. Really I think this whole problem could be solved if we knew the modified state of each column. If the value has been modified from the persisted value, then we would only do the conversion one way (string to binary), otherwise we would only do it the other way (binary to string). Right now this is determined by checking for a \nnn sequence which is unreliable. Another (better?) approach would be to keep track of whether a particular attribute has gone through the type cast process, caching the casted value, and only calling the type cast code when necessary. This would be preferable anyway for expensive type casts such as this one, or complex geometry data types such as those supported by PostGIS. --Matt On Mar 23, 4:25 pm, "Michael Koziarski" <mich...@koziarski.com> wrote:> > Any other ideas? > > I don''t quite see why we need to check, but I don''t really store blobs > so I could be missing something obvious. > > When we''re storing strings into varchar columns we don''t need to > decide whether or not to escape the values, we simply remove a few key > values when sticking them into queries and the rest is left untouched? > > -- > Cheers > > Koz--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
On Mar 23, 4:25 pm, "Michael Koziarski" <mich...@koziarski.com> wrote:> I don''t quite see why we need to check, but I don''t really store blobs > so I could be missing something obvious.PostgreSQL returns all data as strings (unless returning the entire result set in binary format, which is not really practical in this case). A string is a sequence of characters, which may be different than a sequence of bytes. So, we need to do the conversion at some point, the only question is when and where to do that conversion. This is just like any other type conversion, binary conversion should not be a special case in AR.> When we''re storing strings into varchar columns we don''t need to > decide whether or not to escape the values, we simply remove a few key > values when sticking them into queries and the rest is left untouched?That''s not entirely true. If you try to put invalid byte sequences into a text field, you will get an error. If it happens that you are not using encoding, than a sequence of characters is (almost) equivalent to a sequence of bytes. But that is not true for all users. Regards, Jeff Davis --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
On Mar 23, 10:12 pm, Matt Olson <mo8...@gmail.com> wrote:> Really I think this whole problem could be solved if we knew the > modified state of each column. If the value has been modified from the > persisted value, then we would only do the conversion one way (string > to binary), otherwise we would only do it the other way (binary to > string). Right now this is determined by checking for a \nnn sequence > which is unreliable. > > Another (better?) approach would be to keep track of whether a > particular attribute has gone through the type cast process, caching > the casted value, and only calling the type cast code when necessary. > This would be preferable anyway for expensive type casts such as this > one, or complex geometry data types such as those supported by > PostGIS.I''m inclined to say that if we follow two these two rules in AR: 1. Any value from the database is converted to the ruby type that higher layers expect to see, *before* assigning it to the attribute. 2. Any value being written to the database is converted just before it''s sent to the underlying database driver. All the problems will go away. There''s nothing special about binary data, there are many types that need correct conversion to/from the database. However, I also like your idea, because it allows lazy conversions (which could be a win on large binary data). But the most important thing to me is correctness. Regards, Jeff Davis --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core-unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---