[Info-ingres] More similar to

Fri Aug 31 11:28:14 UTC 2018

Baroque indeed and I would expect no less.

I suspect as the data is stored in a long varchar that it was done in UTF-8.
U+201C

"

e2 80 9c

LEFT DOUBLE QUOTATION MARK

U+201D

"

e2 80 9d

RIGHT DOUBLE QUOTATION MARK

So in my data I see non asci characters E2809C...E2809D. So a simple per character match is probably insufficient, I need to examine multiple characters to work out  what the hell they intended to say and recommend replacing both those character sequences with an ordinary double quote (x22).

Yikes.

Marty

From: Adrian Williamson [mailto:adrian.williamson at rationalcommerce.com]
Sent: 31 August 2018 12:00
To: Martin Bowes; info-ingres at lists.planetingres.org
Subject: RE: [Info-ingres] More similar to

Hi Marty,

I think your pain has a name : Transliteration

Roy invented some 'baroque' SQL for a customer recently which did the job for a varchar(25) to correct a load of issues for soundex_dm computed values.

I'm not sure it would scale to 12,000 as the working part of the SQL Insert / select had hard coded ordinal positions for each character.

It's the closest I've seen SQL being made to look like some C-Language gobbledygook.

I'll let him explain.

Cheers

Adrian

From: info-ingres-bounces at lists.planetingres.org <info-ingres-bounces at lists.planetingres.org> On Behalf Of Martin Bowes
Sent: 31 August 2018 11:41
To: info-ingres at lists.planetingres.org
Subject: [Info-ingres] More similar to

Hi All,

I just knew this non printing character stuff was going to come back and bite me...

I can identify the rows with non printing characters excluding \t, \r and \n.

Now I need to find which characters are being objected to on each row. I may need to know their position as well.

The data in question is a long varchar. But (fortunately) the longest item is under 12000 characters, so we are under the 32000 char varchar conversion limit.

The easiest way I can think of doing this is to do it in a database procedure...

create procedure identify_bad_char
as declare
    snapshot_id integer4 not null not default;
    res_note    long varchar not null not default;
    msgid       integer4 not null not default;
    msg         varchar(256) not null not default;
begin
    msgid = 0;
    FOR SELECT snapshot_id, res_note INTO :snapshot_id, :res_note
        FROM basket_snapshot
        WHERE res_note SIMILAR TO '%[^' + x'09' + x'0a' + x'0d' + '[:print:]]%'
    DO
        msgid = msgid + 1;
        msg = 'res_note for snapshot_id = ' + varchar(:snapshot_id) + ' is dodgy.';
        message :msgid :msg;

        /* Insert manic while loop here and process the line character by character */
    ENDFOR;
end;

Before I start coding up the manic while loop ... does anyone have a suggestion as to how I can do this more efficiently.

The data was probably added to a web form via a cut and paste from a word document, so we've picked up things like smart quotes. Hence we may ultimately be interested in replacing multi character sequences with simple ansi equivalents. I have a suspicion a clean() function is galloping in my direction.

Martin Bowes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.planetingres.org/pipermail/info-ingres/attachments/20180831/d0c65371/attachment.html>