Node 1.3.2 stopped, frequent "Restarting binary due to error"

rothbardian · July 21, 2020, 3:04am

Node v1.3.2 stopped working on its own. It took me a while to notice it because network traffic continued to look normal. I restarted the node service first and only later realized that /opt/xxnetwork/node-logs/node-err.log got wiped by that (if it was written as the log suggests).

DEBUG 2020/07/21 10:51:41 Successfully contacted local address!
INFO 2020/07/21 10:51:41 Adding dummy users to registry
INFO 2020/07/21 10:51:41 Waiting on communication from gateway to continue
INFO 2020/07/21 10:51:42 Communication from gateway received
INFO 2020/07/21 10:51:42 Updating to WAITING
INFO 2020/07/21 10:51:44 Updating to ERROR
FATAL 2020/07/21 10:51:44 Error encountered - closing server & writing error to /opt/xxnetwork/node-logs/node-err.log: Received error polling for permisioning: Issue polling permissioning: rpc error: code = Unknown desc = Round has failed, state cannot be updated
gitlab.com/elixxir/comms/node.(*Comms).SendPoll.func1
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/node/register.go:62
gitlab.com/elixxir/comms/connect.(*Host).send
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/connect/host.go:206
gitlab.com/elixxir/comms/connect.(*ProtoComms).Send
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/connect/comms.go:285
gitlab.com/elixxir/comms/node.(*Comms).SendPoll
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/node/register.go:69
gitlab.com/elixxir/server/permissioning.PollPermissioning
	/builds/elixxir/server/permissioning/permissioning.go:159
gitlab.com/elixxir/server/permissioning.Poll
	/builds/elixxir/server/permissioning/permissioning.go:69
gitlab.com/elixxir/server/node.NotStarted.func1
	/builds/elixxir/server/node/changeHandlers.go:192
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357
Failed to send
gitlab.com/elixxir/server/permissioning.PollPermissioning
	/builds/elixxir/server/permissioning/permissioning.go:161
gitlab.com/elixxir/server/permissioning.Poll
	/builds/elixxir/server/permissioning/permissioning.go:69
gitlab.com/elixxir/server/node.NotStarted.func1
	/builds/elixxir/server/node/changeHandlers.go:192
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357
panic: Error encountered - closing server & writing error to /opt/xxnetwork/node-logs/node-err.log: Received error polling for permisioning: Issue polling permissioning: rpc error: code = Unknown desc = Round has failed, state cannot be updated
gitlab.com/elixxir/comms/node.(*Comms).SendPoll.func1
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/node/register.go:62
gitlab.com/elixxir/comms/connect.(*Host).send
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/connect/host.go:206
gitlab.com/elixxir/comms/connect.(*ProtoComms).Send
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/connect/comms.go:285
gitlab.com/elixxir/comms/node.(*Comms).SendPoll
	/root/go/pkg/mod/gitlab.com/elixxir/[email protected]/node/register.go:69
gitlab.com/elixxir/server/permissioning.PollPermissioning
	/builds/elixxir/server/permissioning/permissioning.go:159
gitlab.com/elixxir/server/permissioning.Poll
	/builds/elixxir/server/permissioning/permissioning.go:69
gitlab.com/elixxir/server/node.NotStarted.func1
	/builds/elixxir/server/node/changeHandlers.go:192
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357
Failed to send
gitlab.com/elixxir/server/permissioning.PollPermissioning
	/builds/elixxir/server/permissioning/permissioning.go:161
gitlab.com/elixxir/server/permissioning.Poll
	/builds/elixxir/server/permissioning/permissioning.go:69
gitlab.com/elixxir/server/node.NotStarted.func1
	/builds/elixxir/server/node/changeHandlers.go:192
runtime.goexit
	/usr/lib/go-1.13/src/runtime/asm_amd64.s:1357

goroutine 64 [running]:
log.(*Logger).Panic(0xc0002e8af0, 0xc000259738, 0x1, 0x1)
	/usr/lib/go-1.13/src/log/log.go:212 +0xaa
gitlab.com/elixxir/server/internal.CreateServerInstance.func1(0xc0001eae00, 0x647)
	/builds/elixxir/server/internal/instance.go:114 +0x7f
gitlab.com/elixxir/server/node.Error(0xc000094a80, 0xb38004, 0x5ec301)
	/builds/elixxir/server/node/changeHandlers.go:419 +0x45c
gitlab.com/elixxir/server/cmd.StartServer.func8(0xc000000001, 0xc0002598a4, 0x1)
	/builds/elixxir/server/cmd/node.go:122 +0x2d
gitlab.com/elixxir/server/internal/state.Machine.stateChange(0xc0002cfd18, 0xc0002d15c0, 0xc000255160, 0xda6750, 0xc000255170, 0xda6758, 0xc000255180, 0xda6760, 0xc000255190, 0xda6768, ...)
	/builds/elixxir/server/internal/state/state.go:400 +0x7d
gitlab.com/elixxir/server/internal/state.Machine.Update(0xc0002cfd18, 0xc0002d15c0, 0xc000255160, 0xda6750, 0xc000255170, 0xda6758, 0xc000255180, 0xda6760, 0xc000255190, 0xda6768, ...)
	/builds/elixxir/server/internal/state/state.go:204 +0x1ad
gitlab.com/elixxir/server/internal.(*Instance).reportFailure(0xc000094a80, 0xc000354b40)
	/builds/elixxir/server/internal/instance.go:589 +0x3e1
gitlab.com/elixxir/server/internal.(*Instance).ReportRoundFailure(0xc000094a80, 0xe94440, 0xc0003e4300, 0xc0000399e0, 0x0)
	/builds/elixxir/server/internal/instance.go:552 +0x257
gitlab.com/elixxir/server/internal.(*Instance).ReportNodeFailure(...)
	/builds/elixxir/server/internal/instance.go:521
gitlab.com/elixxir/server/node.NotStarted.func1(0xc000094a80)
	/builds/elixxir/server/node/changeHandlers.go:196 +0x4b7
created by gitlab.com/elixxir/server/node.NotStarted
	/builds/elixxir/server/node/changeHandlers.go:164 +0x698

Restarting binary due to error in the wrapper log seems quite frequent.

[INFO] 21-Jul-20 10:51:34: Executing command: {'command': 'start', 'nodes': None}
[INFO] 21-Jul-20 10:51:34: /opt/xxnetwork/bin/xxnetwork-node started at PID 23072
[INFO] 21-Jul-20 10:51:34: Completed command: {'command': 'start', 'nodes': None}
[WARNING] 21-Jul-20 10:51:45: Restarting binary due to error...
[INFO] 21-Jul-20 10:51:45: /opt/xxnetwork/bin/xxnetwork-node started at PID 23116
[WARNING] 21-Jul-20 10:55:37: Restarting binary due to error...
[INFO] 21-Jul-20 10:55:37: /opt/xxnetwork/bin/xxnetwork-node started at PID 23348
[WARNING] 21-Jul-20 10:57:56: Restarting binary due to error...
[INFO] 21-Jul-20 10:57:56: /opt/xxnetwork/bin/xxnetwork-node started at PID 23515
[WARNING] 21-Jul-20 11:00:26: Restarting binary due to error...
[INFO] 21-Jul-20 11:00:26: /opt/xxnetwork/bin/xxnetwork-node started at PID 23678
[WARNING] 21-Jul-20 11:00:38: Restarting binary due to error...
[INFO] 21-Jul-20 11:00:38: /opt/xxnetwork/bin/xxnetwork-node started at PID 23721

Any suggestion on how to debug or gather helpful diag info for this?

alexdupre · July 21, 2020, 5:16am

Nothing strange or unexpected. Your node behavior is the same as any other. The node-log.log file is not for you, it’s a temporary file for the XX team, you can see the same message in the node.log.

rothbardian · July 21, 2020, 5:43am

Are the frequent Node service restarts expected as well?

Recent minute marks when Node service was restarted:

01 04 08 14 15 20 27 28 30 34 36 38 45 48 50 55 57 59 04 06 08 10 13 13 19 23 30 31 33 39 44 45 46 48 51 52 58 59 01 04 06 08 10 11 13 15 17 18 19 21 27 29 31 31 33 39 39 41 43 51 51 51 51 51 51 51 51 51 51 51 55 57 00 00 01 01 04 09 13 17 20 24 30 32 34 40 43 48 50 51 54 00 07 12 14 17 20 23 28 31 35 36 37 39 44 49 51 55 55 59 01 02

alexdupre · July 21, 2020, 5:44am

Yes

rothbardian · July 21, 2020, 5:45am

Well, after a restart (or crash), I’d expect it to be back in service, but it dropped out for many hours, despite being nominally “up” (thanks to the service wrapper script)

Restarts are very frequent, but if that’s normal fine (as long as service actually runs). Recent minute marks when Node service was restarted:

01 04 08 14 15 20 27 28 30 34 36 38 45 48 50 55 57 59 04 06 08 10 13 13 19 23 30 31 33 39 44 45 46 48 51 52 58 59 01 04 06 08 10 11 13 15 17 18 19 21 27 29 31 31 33 39 39 41 43 51 51 51 51 51 51 51 51 51 51 51 55 57 00 00 01 01 04 09 13 17 20 24 30 32 34 40 43 48 50 51 54 00 07 12 14 17 20 23 28 31 35 36 37 39 44 49 51 55 55 59 01 02

Why I ask is mostly healthchecks. If we look at the service status, network traffic, top output, etc. it seems everything is normal, but in fact I was completely down and out. I’d hope to avoid being down without knowing it.

pkrasam · July 21, 2020, 6:06am

tail -f xxnetwork-wrapper.log
[WARNING] 20-Jul-20 13:01:44: Log has reached maximum size. Clearing...
[INFO] 20-Jul-20 13:01:44: Log has been cleared. New Size: 0
[WARNING] 20-Jul-20 16:48:24: Log has reached maximum size. Clearing...
[INFO] 20-Jul-20 16:48:24: Log has been cleared. New Size: 0
[WARNING] 20-Jul-20 20:23:43: Log has reached maximum size. Clearing...
[INFO] 20-Jul-20 20:23:43: Log has been cleared. New Size: 0
[WARNING] 20-Jul-20 23:55:57: Log has reached maximum size. Clearing...
[INFO] 20-Jul-20 23:55:57: Log has been cleared. New Size: 0
[WARNING] 21-Jul-20 03:40:38: Log has reached maximum size. Clearing...
[INFO] 21-Jul-20 03:40:38: Log has been cleared. New Size: 0

rothbardian · July 21, 2020, 6:09am

That’s interesting, I don’t have a single “Clearing” in my logs (both Node and Gateway, both are v1.3.2)

Keith · July 21, 2020, 4:30pm

Check you have the most recent wrapper script.

rothbardian · July 21, 2020, 5:02pm

My xxnetwork-wrapper.py is slightly different (which surprised me, considering I installed 2 days ago and wrapper.py was updated 1 week ago). I replaced the older file, thanks!

dainiusk · August 10, 2020, 7:08pm

Was getting the same “restarting binary due to error” every few minutes, but thought that the Aug 06 update would have updated the wrapper and fixed it as suggested above. Seems there is still something the node does not like:
[WARNING] 06-Aug-20 20:19:54: Restarting binary due to error... [INFO] 06-Aug-20 20:19:54: /opt/xxnetwork/bin/xxnetwork-node started at PID 16699 [WARNING] 06-Aug-20 20:22:49: Restarting binary due to error... [INFO] 06-Aug-20 20:22:49: /opt/xxnetwork/bin/xxnetwork-node started at PID 16924 [WARNING] 06-Aug-20 20:25:20: Restarting binary due to error... [INFO] 06-Aug-20 20:25:20: /opt/xxnetwork/bin/xxnetwork-node started at PID 17097

Keith · August 10, 2020, 7:36pm

This is due to round timeout errors and when a timeout occurs, the node process is restarted. This is the intended behavior.