[Info-ingres] Weird copy timeout error E_LC0030_WRITE_SEND_FAIL

Mon Sep 18 09:16:09 UTC 2017

Hi Alex,

This now appears to be related to crossing a network domain.

The target I've been using is in a different network domain, but if I switch to a target in the same domain as the source then all is fine. Note that this new target is running the same Ingres version as the old target. There may be some configuration differences between them, but I suspect it would be minor.

I've asked our network people for their opinions.

Marty

From: Martin Bowes [mailto:martin.bowes at ndph.ox.ac.uk]
Sent: 14 September 2017 15:37
To: Alex Hanshaw; info-ingres at lists.planetingres.org
Subject: Re: [Info-ingres] Weird copy timeout error E_LC0030_WRITE_SEND_FAIL

Hi Alex,

I'm going to have to do some more testing to flesh this out in detail.

What I have at the moment is some weird combination of the delay and the filesize.

At the moment with a 1000s pause a 175million row file fails the copy as indicated, a 150million row file does not fail.

With a 10 second pause the 175million row file succeeds.

I'll play with like to like versions tomorrow, and when that's done I'll raise an issue. I have a simple means of generating random data and a simple ESQL program which does the deed.

Marty

From: Alex Hanshaw [mailto:Alex.Hanshaw at actian.com]
Sent: 14 September 2017 14:47
To: Martin Bowes; info-ingres at lists.planetingres.org<mailto:info-ingres at lists.planetingres.org>
Subject: RE: [Info-ingres] Weird copy timeout error E_LC0030_WRITE_SEND_FAIL

Hi Marty

The code check are looking for message types and these are all going to be established by looking at structures derived from message pointers.
Is this only happening when these two patch levels are used? Does the problem go away if both sides are at the same patch level?
I'm wondering if something changed a related structure and has cause an incompatibility that is not being correctly handled.

Alex

From: info-ingres-bounces at lists.planetingres.org<mailto:info-ingres-bounces at lists.planetingres.org> [mailto:info-ingres-bounces at lists.planetingres.org] On Behalf Of Martin Bowes
Sent: 14 September 2017 14:09
To: info-ingres at lists.planetingres.org<mailto:info-ingres at lists.planetingres.org>
Subject: [Info-ingres] Weird copy timeout error E_LC0030_WRITE_SEND_FAIL

Hi All,

I have a job which runs on a host with ingres version II 10.2.0 (a64.lnx/100) + 15162.

It uses a vnode to connect to a database on another host which runs ingres version II 10.2.0 (a64.lnx/100) + 15151.

Having established the connection to the remote database it does some initial work and then must pause activity on that connection while work is being performed on other hosts/databases. After that work is completed the connection then gets to do as follows:
drop table if exists targetable; /* Which works with no error */

create table targetable(
    a integer4 not null not default,
    b integer4 not null not default,
   c integer4 not null not default
) with nojournaling;
/* And this too works with no error. Note that the columns are all just plain old integers. Nothing fancy No blobs No nvarchar.
*/

copy table targetable(a=c0tab, b=c0tab, c=c0nl) from 'a/raging/great/data/file';

And on that last step we have recently started getting an error:
E_LC0030_WRITE_SEND_FAIL       GCA protocol service (GCA_SEND) failure with message type GCA_CDATA.
Internal service status E_GCfe06 -- Write to peer process failed; it may have exited. - System communication error: Connection reset by peer..Exiting session because of communications failure.

In the errlog on the target installation we see:
biota             ::[39831        IIGCC, 13193     , 0000000000000002]: Thu Sep 14 13:15:24 2017 E_GC2820_CONN_FAIL_INFO    Connection to node '::ffff:10.131.0.3', port '59824' for user 'ingres' failed: reason follows.
biota             ::[39831        IIGCC, 13193     , 0000000000000002]: Thu Sep 14 13:15:24 2017 E_CLFE07_BS_READ_ERR   Read from peer process failed; it may have exited.
biota             ::[39831        IIGCC, 13193     , 0000000000000002]: System communication error: Connection reset by peer.
BIOTA             ::[39278             , 13088     ,  00007f4eacc6f180, scscopy.c:613         ]: Thu Sep 14 13:15:24 2017 E_SC022E_WRONG_BLOCK_TYPE Internal Protocol Error: SCF received block type 00000005 (5.) when expecting type 00000019 (25.).
BIOTA             ::[39278             , 13088     ,  00007f4eacc6f180, scscopy.c:614         ]: Thu Sep 14 13:15:24 2017 E_SC0250_COPY_OUT_OF_SEQUENCE A COPY data block was received when one was not expected (or not received when expected).

I have managed to show that this is a weird ass timeout. The jobs pause has now breached 15minutes (where have I seen that number before). The really curious thing is that the connection is perfectly fine with anything other than a copy.

I'm working on a test case at the moment, but as it relies on  a very large data file it's a bit of a nuisance.

Anyone seen anything like this before?

Marty
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.planetingres.org/pipermail/info-ingres/attachments/20170918/1a4d8a07/attachment.html>