> Grant, > > Did you sort out the last matter you mentioned, regarding getting UTF-8 > data into MySQL? I gave up quickly. The truth is I don't receive enough data that needs to be in UTF-8 to spend too much time on it at this point. It would of course be nice to have IC working well with UTF-8 but there are others items with a higher priority for me right now. I was hoping to set a catalog variable I suppose. It sounds like End Point's work with UTF-8 will go a long way toward improving it, and I thank you guys in advance. - Grant > IIRC, Interchange doesn't do much of anything with the incoming data > (for a POST or whatever) as far as encoding is concerned; it simply > assumes raw encoding on the filehandle between Interchange and the > vlink/tlink script. > > I believe this can work, provided that: > * the actual web pages themselves, and the forms therein, are properly > encoded with UTF-8, marked as such, and thus the browser submits data in > UTF-8; > * the client encoding on your DBD::mysql connection is set to raw, or > whatever MySQL's equivalent encoding name for this is (I cannot > remember; I seem to recall that MySQL may treat the latin1 encoding as > simple raw encoding, in which case it wouldn't make a difference -- I > moved to Postgres when I started dealing with any real UTF-8 data). > > This is all just treating it as raw data, which isn't necessarily > ideal. For one, if the data is coming in as raw byte strings (as > outlined above), then regexes will give you funky behavior (for > instance, the HTML entity encoding routines will appear to break your > data). This is because in a raw string, each character represents an > octet rather than an actual character, but Perl has no way of knowing > that. So, what is in fact a valid high-bit sequence in UTF-8 (for > representing any character outside the 7-bit ASCII range) will appear as > a a series of odd characters in the raw string if you were to simply > print the raw string to a non-UTF8 terminal. In order for regexes to > work reliably, the raw data needs to be re-encoded as a UTF8 scalar, > which requires messing with the Perl Encode module. > > If you don't need to run regexes or HTML entity filters or whatever > against your inbound data, then you could probably get by with raw > encoding. Otherwise, this will probably bite you. > > Assuming the data gets safely into MySQL as well-formed UTF8 (or > assuming the data already exists in MySQL), pulling the data out is > another matter. You'll need to look at the docs for DBD::mysql to see > what it offers for UTF8 support, or to see if reading the data in from a > database handle with the client encoding set to UTF8 would do the > trick. Basically, UTF8 data coming out of the database will break in > things like the table editor because of the same regular expression > problem already mentioned; byte sequences that correspond to a single > logical character are treated as separate characters and therefore > semantically mismatch with the intentions of the regular expressions for > things like HTML entity escaping. DBD::Pg (for Postgres) provides a > setting for telling the driver to properly elevate text scalars to UTF8, > which can address this issue; I'm not familiar with DBD::mysql's > offerings for this sort of thing. If you can get the data returned from > MySQL to be automatically elevated to UTF8 before Interchange touches > it, then you may pull it off. > > It's a complicated issue. Once you have one Perl scalar that is marked > internally as UTF8, any scalar it combines with will be elevated > on-demand to UTF8. So, in theory, having one UTF8 string coming from > one column in one record of your database could cause the entire output > buffer for a page to be elevated. But what about all your template > pages and such, and their encodings? File encoding is a somewhat > mysterious topic, since files aren't typically flagged as being in a > particular encoding. You have to know what kinds of encoding you're > using in every aspect of your application in order for this to work out > in a controlled fashion. > > And of course getting everything elevated to UTF8 will impose some kind > of performance penalty. Probably not anything worth worrying about, I > would guess, but it's best to be prepared. > > As Jon said a little while ago, we're (that is, End Point) preparing a > change set to improve UTF8 support, and we've been making good > progress. Once it's ready and the IC core team has their say, it should > help considerably. However, it will remain a complex issue that > requires a lot of attention to detail and a significant headache, as it > affects all layers of the software stack. > > One final note: if you're working with UTF-8, then you will inevitably > end up feeling a deep sense of loathing for CP1252, because it pops up > *everywhere*. If your MySQL data is supposedly latin1, then it's almost > certainly really CP1252. :) > > Thanks. > - Ethan _______________________________________________ interchange-users mailing list suppressed http://www.icdevgroup.org/mailman/listinfo/interchange-users
Mail converted by mhonarc 2.6.15
This archive provided courtesy of JSW4.NET, Internet Hosting Services for Small Business.