[Info-ingres] Ingres 9.0.4 nameserver freezing (but the rest of Ingres is fine, and restarting the nameserver manually recovers operation)...

Paul White paul.white at shift7solutions.com.au
Sat Oct 28 06:07:06 UTC 2017


Hi Michel,

Do you know how many new connections are being made to DB server every minute/second?

 

netstat -n > n1

sleep 10

netstat -n > n2

wc n1 n2

diff n1 n2 

 

Under windows, I have recorded an occasional iign.exe crash on several Windows servers.  

ingstart –iigcn fixes the problem (but don’t log out that windows session hey)

 

My theory is that a badly behaved .Net application is making and dropping connections about 50 times per second over a short period.

It leaves 1,000s of TIME_WAIT connections standing for about a minute.

 

Paul

 

 

 

 

From: info-ingres-bounces at lists.planetingres.org [mailto:info-ingres-bounces at lists.planetingres.org] On Behalf Of Michel Forget
Sent: Saturday, 28 October 2017 1:57 PM
To: info-ingres at lists.planetingres.org
Subject: [Info-ingres] Ingres 9.0.4 nameserver freezing (but the rest of Ingres is fine, and restarting the nameserver manually recovers operation)...

 

Hi,

I was referred to this mailing-list in hopes of finding a solution for a problem that has been vexing me.

I have an Ingres 9.0.4  (community edition, 64-bit) installation that is experiencing a strange "freeze".  The symptom observed by the users is that existing connections to the DBMS are just fine, but no new connections to the DBMS can be established (either from elsewhere on the network through Ingres/NET, or even locally).

After some trial-and-error, and some diagnostic help from Actian, we've determined that the problem isn't actually the DBMS itself.  If we kill -9 the iigcn process, then do instart -iigcn then all returns to normal.  The funny thing is that there is no information in the errlog.log file that indicates a problem at all.  

Using this approach, we've been able to avoid downtime (to keep the users happy) but we aren't really closer to a solut! ion.  I -could- write a script that runs once a minute, checks if it can establish a connection to the DBMS, then stops/starts the iigcn process if it can't but I'm hoping to reach an actual solution to the problem so that such a thing isn't necessary.

The operating system that Ingres is installed on (which is running as a virtual machine in the Azure cloud) is:

CentOS Linux release 7.3.1611 (Core)

The configuration of the nameserver is currently the default configuration, except that session_limit is set to 64 instead of 16.  While this virtual machine running in the Azure cloud is new, the configuration of Ingres was migrated from an older system.  I believe this problem happened once or twice over a two year period on the old system, but it was never happening every day like this is.  I can't see session_limit being a problem, but just the same I've reverted it back to 16.  My nameserver configuration is below:

Name Server Parameters:

Name                │Value               │Units

check_interval      │300                 │seconds

check_timeout       │20                  │seconds

check_type          │connect,install,clas│string

compress_point      │50                  │integer

default_server_class│INGRES              │string 

expire_interval     │300                 │seconds

local_vnode         │REDACTED         |

mechanisms          │                    │mechanism list

registry_type       │none                │installation type

remote_mechanism    │none                │none, default, mechanism name

remote_vnode        │                    │virtual node

session_limit       │16                  │sessions

ticket_cache_size   │10                  │tickets

ticket_expire       │1800                │seconds

timeout             │60                  │seconds


In terms of the operating system itself, the /etc/sysctl.conf file consists of only one entry:

kernel.shmmax = 300000000

The Actian representative I spoke with had me gather some information (of which #3 looks interesting diagnostically speaking):

1) the output of pstack against the iigcn process:

#0  0x00007f3d74461de0 in __poll_nocancel () from /lib64/libc.so.6

#1  0x000000000043d093 in ii_CL_poll_poll ()

#2  0x000000000043c692 in iiCLpoll ()

#3  0x000000000043810a in GCexec ()

#4  0x000000000040ec21 in main ()


2) the output of lsof -p agaisnt the iigcn process:

COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF      NODE NAME

iigcn   18868 ingres  cwd    DIR      8,1     4096 100760268 /home/ingres

iigcn   18868 ingres  rtd    DIR      8,1     4096       128 /

iigcn   18868 ingres  txt    REG      8,1   601827 101588538 /opt/Ingres/IngresII/ingres/bin/iigcn

iigcn   18868 ingres  mem    REG      8,1    62184 101735988 /usr/lib64/libnss_files-2.17.so

iigcn   18868 ingres  mem    REG      8,1    11384 100664811 /usr/lib64/libfreebl3.so

iigcn   18868 ingres  mem    REG      8,1    88720 100760266 /usr/lib64/libgcc_s-4.8.5-20150702.so.1

iigcn   18868 ingres  mem    REG      8,1    19776 101735975 /usr/lib64/libdl-2.17.so

iigcn   18868 ingres  mem    REG      8,1    41080 100760262 /usr/lib64/libcrypt-2.17.so

iigcn   18868 ingres  mem    REG      8,1  2118128 100760258 /usr/lib64/libc-2.17.so

iigcn   18868 ingres  mem    REG      8,1  1141928 101735977 /usr/lib64/libm-2.17.so

iigcn   18868 ingres  mem    REG      8,1   143944 101812385 /usr/lib64/libpthread-2.17.so

iigcn   18868 ingres  mem    REG      8,1   155464 100749243 /usr/lib64/ld-2.17.so

iigcn   18868 ingres    0u   CHR      1,3      0t0      1028 /dev/null

iigcn   18868 ingres    1w   CHR      1,3      0t0      1028 /dev/null

iigcn   18868 ingres    2w   CHR      1,3      0t0      1028 /dev/null

iigcn   18868 ingres    3u  IPv4 58898831      0t0       TCP *:36750 (LISTEN)

iigcn   18868 ingres    4r   REG      8,1  2121728  33702776 /opt/Ingres/IngresII/ingres/files/english/slow_v4.mnx

iigcn   18868 ingres    5w   REG      8,1 70345162  69239389 /opt/Ingres/IngresII/ingres/files/errlog.log

iigcn   18868 ingres    6u  IPv4 58973717      0t0       TCP localhost:55778->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres    7u  IPv4 58975824      0t0       TCP localhost:57130->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres    8u  IPv4 58981059      0t0       TCP localhost:58514->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres    9u  IPv4 58987208      0t0       TCP localhost:59902->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   10u  IPv4 58992237      0t0       TCP localhost:33066->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   11u  IPv4 58997496      0t0       TCP localhost:34474->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   12u  IPv4 59006999      0t0       TCP localhost:35826->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   13u  IPv4 59009214      0t0       TCP localhost:37182->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   14u  IPv4 59013545      0t0       TCP localhost:38628->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   15u  IPv4 59017155      0t0       TCP localhost:40132->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   16u  IPv4 59023532      0t0       TCP localhost:41520->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   17u  IPv4 59022321      0t0       TCP localhost:42880->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   18u  IPv4 59035815      0t0       TCP localhost:44406->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   19u  IPv4 59037451      0t0       TCP localhost:45748->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   20u  IPv4 59044416      0t0       TCP localhost:47218->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   21u  IPv4 60055611      0t0       TCP localhost:48546->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   22u  IPv4 60059812      0t0       TCP localhost:49844->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   23u  IPv4 60059483      0t0       TCP localhost:51224->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   24u  IPv4 60071073      0t0       TCP localhost:52632->localhost:34546 (ESTABLISHED)

iigcn   18868 ingres   25u  IPv4 61921071      0t0       TCP localhost:36750->localhost:43066 (ESTABLISHED)


3) the output of netstat | grep <gcn port> (which was 36750 at the time)

tcp        0      0 localhost:36750         localhost:45424         SYN_RECV

tcp        0      0 localhost:36750         localhost:45488         SYN_RECV

tcp        0      0 localhost:36750         localhost:45354         SYN_RECV

tcp        0      0 localhost:36750         localhost:45450         SYN_RECV

tcp        0      0 localhost:36750         localhost:45264         SYN_RECV

tcp        0      0 localhost:36750         localhost:45278         SYN_RECV

tcp        0      0 localhost:36750         localhost:45262         SYN_RECV

tcp        0      0 localhost:36750         localhost:45408         SYN_RECV

tcp        0      0 localhost:36750         localhost:45390         SYN_RECV

tcp        0      0 localhost:36750         localhost:45338         SYN_RECV

tcp        0      0 localhost:36750         localhost:45356         SYN_RECV

tcp        0      0 localhost:36750         localhost:45410         SYN_RECV

tcp        0      0 localhost:36750         localhost:45352         SYN_RECV

tcp        0      0 localhost:43128         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45424         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43614         localhost:36750         ESTABLISHED

tcp      258      0 localhost:36750         localhost:43128         ESTABLISHED

tcp        0      0 localhost:43066         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45450         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43542         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45262         localhost:36750         ESTABLISHED

tcp      258      0 localhost:36750         localhost:43614         ESTABLISHED

tcp        0      0 localhost:43492         localhost:36750         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43764         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43726         ESTABLISHED

tcp        0    309 localhost:45236         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43724         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43126         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43366         localhost:36750         ESTABLISHED

tcp      345      0 localhost:36750         localhost:43370         ESTABLISHED

tcp        0    309 localhost:45488         localhost:36750         ESTABLISHED

tcp        0    358 localhost:45108         localhost:36750         ESTABLISHED

tcp        0    258 localhost:45352         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45222         localhost:36750         ESTABLISHED

tcp      258      0 localhost:36750         localhost:43368         ESTABLISHED

tcp      358      0 localhost:36750         localhost:43616         ESTABLISHED

tcp        0      0 localhost:43616         localhost:36750         ESTABLISHED

tcp      358      0 localhost:36750         localhost:43372         ESTABLISHED

tcp        0      0 localhost:43598         localhost:36750         ESTABLISHED

tcp      345      0 localhost:36750         localhost:43612         ESTABLISHED

tcp        0    309 localhost:45408         localhost:36750         ESTABLISHED

tcp        0    358 localhost:45356         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43764         localhost:36750         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43084         ESTABLISHED

tcp      345      0 localhost:36750         localhost:43126         ESTABLISHED

tcp        0      0 localhost:43084         localhost:36750         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43492         ESTABLISHED

tcp        0    309 localhost:45390         localhost:36750         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43218         ESTABLISHED

tcp      358      0 localhost:36750         localhost:43130         ESTABLISHED

tcp      258      0 localhost:36750         localhost:43068         ESTABLISHED

tcp        0    309 localhost:45264         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43068         localhost:36750         ESTABLISHED

tcp        0    345 localhost:45354         localhost:36750         ESTABLISHED

tcp        0      0 localhost:36750         localhost:43066         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43338         ESTABLISHED

tcp        0    258 localhost:45104         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43372         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43338         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43370         localhost:36750         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43542         ESTABLISHED

tcp        0    309 localhost:45090         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43368         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43220         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43726         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45144         localhost:36750         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43598         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43366         ESTABLISHED

tcp        0      0 localhost:43612         localhost:36750         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43724         ESTABLISHED

tcp        0    309 localhost:45410         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45278         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43218         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45184         localhost:36750         ESTABLISHED

tcp        0      0 localhost:43130         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45158         localhost:36750         ESTABLISHED

tcp        0    309 localhost:45338         localhost:36750         ESTABLISHED

tcp      309      0 localhost:36750         localhost:43220         ESTABLISHED

tcp        0    345 localhost:45106         localhost:36750         ESTABLISHED


I don't know enough about TCP/IP networking to say for sure, but this looks like an awful lot of connections on the iigcn port (which should really just be accepting connections, doing its thing, then dropping them).   Also, those connectiosn with a state of SYN_RECV seem like a problem.  I'm not entirely sure how to read this output -- it seems to be saying that there are ~65 active connections going on, with 13 more waiting in the wrings.  I could be entirely wrong about that, though.  I'm no networking guru.

By way of example, when the system ISN'T having this problem the output looks more like this (at which time the gcn port is 37550):

tcp        0      0 localhost:52012         localhost:37550         TIME_WAIT

tcp        0      0 localhost:51888         localhost:37550         TIME_WAIT

tcp        0      0 localhost:51892         localhost:37550         TIME_WAIT

tcp        0      0 localhost:51860         localhost:37550         TIME_WAIT

tcp        0      0 localhost:52032         localhost:37550         TIME_WAIT

tcp        0      0 localhost:52036         localhost:37550         TIME_WAIT

tcp        0      0 localhost:51928         localhost:37550         TIME_WAIT

tcp        0      0 localhost:52068         localhost:37550         TIME_WAIT

tcp        0      0 localhost:51924         localhost:37550         TIME_WAIT

tcp        0      0 localhost:52038         localhost:37550         TIME_WAIT

tcp        0      0 localhost:52008         localhost:37550         TIME_WAIT

tcp        0      0 localhost:52006         localhost:37550         TIME_WAIT

tcp        0      0 localhost:51884         localhost:37550         TIME_WAIT

tcp        0      0 localhost:51896         localhost:37550         TIME_WAIT

tcp        0      0 localhost:52030         localhost:37550         TIME_WAIT

tcp        0      0 localhost:51998         localhost:37550         TIME_WAIT

tcp        0      0 localhost:52002         localhost:37550         TIME_WAIT

 

I'm not really sure where to go from here.  Does anyone have any thoughts?  Anything I could try in order to diagnose this further.  As I said above, I can write a program that will test the connection and recover in the event of a failure, but I'd really prefer to simply have a working system and not have to worry about that.  :)



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.planetingres.org/pipermail/info-ingres/attachments/20171028/03baece8/attachment.html>


More information about the Info-ingres mailing list