What <defunct> Is Wrong?

In this post I’m going to cover:

  • What is xxnetwork-node <defunct>?
  • What should you do about it?
  • Some notes about logs.

NOTE: I have shortened some code and commands in the examples just to make it easier to digest and the code or commands on your computer may not be the same. I’m only going to discuss the node process since gateway is essentially the same.

What is a defunct process?

A process is marked defunct when a child process ends and the parent process continues to run. When the xxnetwork-node service starts, processes are started by other processes.

What are the Parent/Child relationships of the xxnetwork software?

tl;dr: systemd runs bash, bash runs the wrapper script, the wrapper script runs the binary

Here’s a brief walk-through of the process. When I log in to node06 I can get a list of the processes being run by the user ubuntu.

Before starting the xxnetwork software

[email protected]:~$ ps -u ubuntu
  PID TTY          TIME CMD
14634 ?        00:00:00 systemd
14635 ?        00:00:00 (sd-pam)
14702 ?        00:00:00 sshd
19349 ?        00:00:00 sshd
19350 pts/0    00:00:00 bash
26794 pts/0    00:00:00 ps

Then I start the node software manually with $ sudo systemctl start xxnetwork-node

After starting the xxnetwork software

[email protected]:~$ ps -u ubuntu
  PID TTY          TIME CMD
14634 ?        00:00:00 systemd
14635 ?        00:00:00 (sd-pam)
14702 ?        00:00:00 sshd
19349 ?        00:00:00 sshd
19350 pts/0    00:00:00 bash
26807 ?        00:00:00 bash			< new instance of /bin/bash
26816 ?        00:00:00 python3			< new instance of /opt/xxnetwork/xxnetwork-wrapper.py
26932 ?        00:00:01 xxnetwork-node	< new instance of /opt/xxnetwork/bin/xxnetwork-node
27039 pts/0    00:00:00 ps

Three new processes have started; bash, python3, and xxnetwork-node.

systemctl runs the service script, /opt/xxnetwork/xxnetwork-node.service, which is a symlink in /etc/systemd/system/.

$ sudo systemctl start xxnetwork-node

Then xxnetwork-wrapper.py is run by /bin/bash. Here the relevant code from the xxnetwork-node.service file.

ExecStart=/bin/bash -c ' '

What is actually within the single quotes of the bash command shown above is what runs the xxnetwork-node binary. I’ve removed everything except the binary flag for simplicity.

ExecStart=/bin/bash -c '/opt/xxnetwork/xxnetwork-wrapper.py --binary /opt/xxnetwork/bin/xxnetwork-node'

Why is node defunct?

When you see <defunct> it means the xxnetwork-node process, /opt/xxnetwork/bin/xxnetwork-node, has stopped running for some reason. The most common reason is due to a round error. A round error is considered a “Recoverable Error”. It is actually a warning, WARN, not an error but the terms are often used interchangeably.

When there is an error the xxnetwork-node binary creates the file /opt/xxnetwork/node-logs/node-err.log. The wrapper script is written to watch for node-err.log. When the wrapper script sees this file it will stop and start the xxnetwork-node binary, then delete the node-err.log file.

In the time that it takes to stop and start /opt/xxnetwork/bin/xxnetwork-node the OS marks the xxnetwork-node <defunct>. I’ve created a short video that shows what happens on the NodeLab network when a round error occurs.

  • The top left terminal show the xxnetwork-node process and PIDs of all 10 nodes in the NodeLab.
  • The top right terminal shows the log of the permissioning server.
  • The two bottom terminals are of node06.
    • I use the bottom left terminal to stop the xxnetwork processes on node06.
    • The bottom right terminal shows the node.log from node06.

What just happened?

Mid-round, I stopped the xxnetwork services on node06 causing a round error, which is a recoverable error. Two other nodes that were teamed with node06 switched to <defunct>. Their bash and python3 processes were still running but not xxnetwork-node, it was being restarted so the OS considered the process <defunct>.

For a short time three nodes were not available. Two nodes recovered via the wrapper script while node06 was still offline because I had fully stopped the xxnetwork processes.

But it’s not fully recovered yet because node06 has the metric data that the permissioning server needs to put into it’s database. Once node06 came back online the permissioning server reported another error across the team but because all three nodes had the wrapper script running, the xxnetwork-node process on all nodes were restarted and no longer <defunct>. It usually takes the wrapper script a few tries to get everything back to normal.

What to do if node is marked <defunct>?

The first thing to do is to watch the <defunct> process for a little while. The video shows the nodes recovered in a matter of seconds but that’s because in the NodeLab there is very little latency on a 1Gb network. In the wild it will take longer. Give it 3-5 minutes.

$ watch -n2 'ps -A | grep xxnet'

What to do if it stays <defunct>?

Check your logs before restarting the services.

For those of you who aren’t able to send or copy logs to us, don’t worry. Your logs are uploaded to AWS. But if you report an error you must tell us your Node ID so we know which logs to check. Email or DM one of us and say, “Can you check out the logs for [Your Node ID]?”

For those of you who can send or copy logs, you can just DM one of us the whole thing but the best way is to get the useful information with less.

less /opt/xxnetwork/node-logs/node.log

Once the log is open press Shift+ g. That will bring you to the very end of the log file. From there scroll or page up to the useful log information. Copy what you need and to exit less just press the q button.

What is useful log information?

If your node has suffered from an error the wrapper script cannot recover from the log will end with backtrace information. The backtrace is all the lines that start with gitlab.com you see below. It shows the file and line number of the source code that is in question. To get useful information you need to look just before the start of the backtrace. The first line you want to start copying from will begin with PANIC, FATAL, or ERROR and will begin with a date and time stamp. See the example log below.

INFO 2020/07/28 13:51:44 Updating to ERROR
FATAL 2020/07/28 13:51:44 Error encountered - closing server & writing error to /opt/xxnetwork/node-logs/node-err.log: Received error polling for permisioning: Issue polling permissioning: rpc error: code = Unknown desc = Round has failed, state cannot be updated
gitlab.com/elixxir/comms/node.(*Comms).SendPoll.func1
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/node/register.go:62
gitlab.com/xx_network/comms/connect.(*Host).send
	/root/go/pkg/mod/gitlab.com/xx_network/[email protected]/connect/host.go:206
gitlab.com/xx_network/comms/connect.(*ProtoComms).Send
	/root/go/pkg/mod/gitlab.com/xx_network/[email protected]/connect/comms.go:285
gitlab.com/elixxir/comms/node.(*Comms).SendPoll
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/node/register.go:69
gitlab.com/elixxir/server/permissioning.PollPermissioning
	/builds/elixxir/server/permissioning/permissioning.go:175
gitlab.com/elixxir/server/permissioning.Poll
	/builds/elixxir/server/permissioning/permissioning.go:84
gitlab.com/elixxir/server/node.NotStarted.func1
	/builds/elixxir/server/node/changeHandlers.go:197
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357
Failed to send
gitlab.com/elixxir/server/permissioning.PollPermissioning
	/builds/elixxir/server/permissioning/permissioning.go:177
gitlab.com/elixxir/server/permissioning.Poll
	/builds/elixxir/server/permissioning/permissioning.go:84
gitlab.com/elixxir/server/node.NotStarted.func1
	/builds/elixxir/server/node/changeHandlers.go:197
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357
panic: Error encountered - closing server & writing error to /opt/xxnetwork/node-logs/node-err.log: Received error polling for permisioning: Issue polling permissioning: rpc error: code = Unknown desc = Round has failed, state cannot be updated
gitlab.com/elixxir/comms/node.(*Comms).SendPoll.func1
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/node/register.go:62
gitlab.com/xx_network/comms/connect.(*Host).send
	/root/go/pkg/mod/gitlab.com/xx_network/[email protected]/connect/host.go:206
gitlab.com/xx_network/comms/connect.(*ProtoComms).Send
	/root/go/pkg/mod/gitlab.com/xx_network/[email protected]/connect/comms.go:285
gitlab.com/elixxir/comms/node.(*Comms).SendPoll
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/node/register.go:69
gitlab.com/elixxir/server/permissioning.PollPermissioning
	/builds/elixxir/server/permissioning/permissioning.go:175
gitlab.com/elixxir/server/permissioning.Poll
	/builds/elixxir/server/permissioning/permissioning.go:84
gitlab.com/elixxir/server/node.NotStarted.func1
	/builds/elixxir/server/node/changeHandlers.go:197
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357
Failed to send
gitlab.com/elixxir/server/permissioning.PollPermissioning
	/builds/elixxir/server/permissioning/permissioning.go:177
gitlab.com/elixxir/server/permissioning.Poll
	/builds/elixxir/server/permissioning/permissioning.go:84
gitlab.com/elixxir/server/node.NotStarted.func1
	/builds/elixxir/server/node/changeHandlers.go:197
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357

goroutine 68 [running]:
log.(*Logger).Panic(0xc000308690, 0xc0005c5738, 0x1, 0x1)
	/usr/lib/go-1.13/src/log/log.go:212 +0xaa
gitlab.com/elixxir/server/internal.CreateServerInstance.func1(0xc0004cd100, 0x653)
	/builds/elixxir/server/internal/instance.go:126 +0x7f
gitlab.com/elixxir/server/node.Error(0xc0001ff040, 0xb38bd4, 0x5ec301)
	/builds/elixxir/server/node/changeHandlers.go:424 +0x45c
gitlab.com/elixxir/server/cmd.StartServer.func8(0xc000000001, 0xc0005c58a4, 0x1)
	/builds/elixxir/server/cmd/node.go:122 +0x2d
gitlab.com/elixxir/server/internal/state.Machine.stateChange(0xc0002e58dc, 0xc0002f2920, 0xc000138d40, 0xda89b0, 0xc000138d50, 0xda89b8, 0xc000138d60, 0xda89c0, 0xc000138d70, 0xda89c8, ...)
	/builds/elixxir/server/internal/state/state.go:400 +0x7d
gitlab.com/elixxir/server/internal/state.Machine.Update(0xc0002e58dc, 0xc0002f2920, 0xc000138d40, 0xda89b0, 0xc000138d50, 0xda89b8, 0xc000138d60, 0xda89c0, 0xc000138d70, 0xda89c8, ...)
	/builds/elixxir/server/internal/state/state.go:204 +0x1ad
gitlab.com/elixxir/server/internal.(*Instance).reportFailure(0xc0001ff040, 0xc000416180)
	/builds/elixxir/server/internal/instance.go:640 +0x3e1
gitlab.com/elixxir/server/internal.(*Instance).ReportRoundFailure(0xc0001ff040, 0xe96da0, 0xc0001c4260, 0xc000039650, 0x0)
	/builds/elixxir/server/internal/instance.go:603 +0x257
gitlab.com/elixxir/server/internal.(*Instance).ReportNodeFailure(...)
	/builds/elixxir/server/internal/instance.go:572
gitlab.com/elixxir/server/node.NotStarted.func1(0xc0001ff040)
	/builds/elixxir/server/node/changeHandlers.go:201 +0x4b7
created by gitlab.com/elixxir/server/node.NotStarted
	/builds/elixxir/server/node/changeHandlers.go:169 +0x69a

I hope this post helps some of you out. We’ve tried to make being a BetaNet node operator as easy as possible with the least amount of operator intervention. It hasn’t always worked out that way but we are making improvements to the wrapper script and log collection.

Thank you all very much for running a node during BetaNet!

16 Likes

Very useful to know. Thanks Keith :+1:

1 Like

nice to know, thx!

2 Likes

What about Gateway ??

1 Like

Very helpful thanks Keith

I’m happy this was here, as I got this exact error. I checked the log and it seems there was something wrong with the password. Unfortunately couldn’t fix it myself, but at least this gave me some direction. :ok_hand:

Hi guys, could you please check my node?

WARN 2021/04/18 22:24:26 Poll of permissioning failed, will try again in 1s: The NDF was not returned, 'permissioning is likely in the process of vetting the no…

https://dashboard.xx.network/nodes/tMmB3kJ4prrRZvMGqpq_MBRnWwJO3JTYUDFLrFvF22EC

something seems to be wrong here…

boths ports are open, services are green

The WARN message is expected. The following FATAL message is why your node is not connecting.

You log shows the following message:
FATAL 2021/04/19 00:03:01 Failed to start server: Unable to run instance: unable to contact local address:

In the node.yaml file you should find a flag that is commented out

#overrideInternalIP: “0.0.0.1”

Remove the preceding # and change the ip address to “0.0.0.0” like so…

overrideInternalIP: “0.0.0.0”

Restart the node process with:

$ sudo systemctl restart xxnetwork-node

I checked:
overrideInternalIP: “127.0.0.1”

… same thing

@Denim I’m not seeing a gateway log so there’s a few things going on that need to be checked. Can you DM me on our Discord channel? My ID is Keith aka LordVetinari#9085