[Info-ingres] More similar to
Martin Bowes
martin.bowes at ndph.ox.ac.uk
Fri Aug 31 11:28:14 UTC 2018
Baroque indeed and I would expect no less.
I suspect as the data is stored in a long varchar that it was done in UTF-8.
U+201C
"
e2 80 9c
LEFT DOUBLE QUOTATION MARK
U+201D
"
e2 80 9d
RIGHT DOUBLE QUOTATION MARK
So in my data I see non asci characters E2809C...E2809D. So a simple per character match is probably insufficient, I need to examine multiple characters to work out what the hell they intended to say and recommend replacing both those character sequences with an ordinary double quote (x22).
Yikes.
Marty
From: Adrian Williamson [mailto:adrian.williamson at rationalcommerce.com]
Sent: 31 August 2018 12:00
To: Martin Bowes; info-ingres at lists.planetingres.org
Subject: RE: [Info-ingres] More similar to
Hi Marty,
I think your pain has a name : Transliteration
Roy invented some 'baroque' SQL for a customer recently which did the job for a varchar(25) to correct a load of issues for soundex_dm computed values.
I'm not sure it would scale to 12,000 as the working part of the SQL Insert / select had hard coded ordinal positions for each character.
It's the closest I've seen SQL being made to look like some C-Language gobbledygook.
I'll let him explain.
Cheers
Adrian
From: info-ingres-bounces at lists.planetingres.org <info-ingres-bounces at lists.planetingres.org> On Behalf Of Martin Bowes
Sent: 31 August 2018 11:41
To: info-ingres at lists.planetingres.org
Subject: [Info-ingres] More similar to
Hi All,
I just knew this non printing character stuff was going to come back and bite me...
I can identify the rows with non printing characters excluding \t, \r and \n.
Now I need to find which characters are being objected to on each row. I may need to know their position as well.
The data in question is a long varchar. But (fortunately) the longest item is under 12000 characters, so we are under the 32000 char varchar conversion limit.
The easiest way I can think of doing this is to do it in a database procedure...
create procedure identify_bad_char
as declare
snapshot_id integer4 not null not default;
res_note long varchar not null not default;
msgid integer4 not null not default;
msg varchar(256) not null not default;
begin
msgid = 0;
FOR SELECT snapshot_id, res_note INTO :snapshot_id, :res_note
FROM basket_snapshot
WHERE res_note SIMILAR TO '%[^' + x'09' + x'0a' + x'0d' + '[:print:]]%'
DO
msgid = msgid + 1;
msg = 'res_note for snapshot_id = ' + varchar(:snapshot_id) + ' is dodgy.';
message :msgid :msg;
/* Insert manic while loop here and process the line character by character */
ENDFOR;
end;
Before I start coding up the manic while loop ... does anyone have a suggestion as to how I can do this more efficiently.
The data was probably added to a web form via a cut and paste from a word document, so we've picked up things like smart quotes. Hence we may ultimately be interested in replacing multi character sequences with simple ansi equivalents. I have a suspicion a clean() function is galloping in my direction.
Martin Bowes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.planetingres.org/pipermail/info-ingres/attachments/20180831/d0c65371/attachment.html>
More information about the Info-ingres
mailing list