Network Dashboard is on Error

Mash · November 5, 2020, 8:25am

What are the requirements for November? I hate downtime at all!
What can I do to prevent this from happening, I trust a comforting message from xxnetwork but I have to control that?

alexdupre · November 5, 2020, 8:32am

You used just slightly more than 13h out of your 180h downtime for November. Relax and monitor that your node process is always running.

Mash · November 5, 2020, 8:33am

The process was running, thats the whole point

alexdupre · November 5, 2020, 8:34am

Likely the wrapper process, not the node process.

alexdupre · November 5, 2020, 8:47am

The node process exits when it encounter an unexpected error, and the wrapper doesn’t restart it in such case “by design”.

Mash · November 5, 2020, 8:47am

Very strange when checking

sudo systemctl status xxnetwork-node.service

this was the message

Active: active (running)

alexdupre · November 5, 2020, 8:54am

That’s the wrapper. If you look at your node.log (if you still have it) you’ll probably see that for several hours there was no activity.

Mash · November 5, 2020, 8:57am

Is there a smart way to check the node.log automatically on specific terms that indicate something is really wrong, a lot of messages in between that do not indicate the node process is stopped just waiting for something, and the whole process needs to get restarted?

Keith · November 5, 2020, 8:58am

“What are the requirements for November?”
Re: Requirements - Each month you can find the requirements which need to be met at the bottom of the page to the right. https://xx.network/nodes/run
Re: November Requirements - https://xx.network/nodes/uptime-policy-nov-2020.pdf

“What can I do to prevent this from happening, I trust a comforting message from xxnetwork but I have to control that?”

We are in BetaNet and things don’t always work as expected. We have anticipated these kinds of things which is why we have set the requirements we have. We feel they’re reasonable and shouldn’t be too much of a burden.

Keith · November 5, 2020, 9:03am

watch -n1 "ps -A | grep xxnet"

Will show you that state of the processes rather than the service. It will switch to <defunct> if there is a round failure, but the services usually do a good job of restarting the processes. If it gets stuck in the <defunct> for more than a few minutes, that indicates the process has crashed.

For a better understanding check out …https://staging-forum.xx.network/t/what-defunct-is-wrong/1945 and https://staging-forum.xx.network/t/the-xxnetwork-logs-and-useful-tools/1958

You may also want to look into a monitoring solution that another user proposed. https://staging-forum.xx.network/t/monitor-your-node-online-status/2399

Mash · November 7, 2020, 8:31am

Thanks for the hints and the help. Searching for FATAL, PANIC and ERROR in the logs apart from checking if the process is running are possibilities as I understand. As the node process can stop and be restarted with success by the wrapper does not make it that easy to just search for stopped states because a confirmed ERROR state or FATAL does not mean it can not be restarted, but you do not know for sure. Those states are frequently appearing in the log. Did not find PANIC now.

When directly checking you can watch but it needs to check automatically Idea is to make a script to check on states ERROR or FATAL for the last 10 lines of the log or so and in time interval check again when such a state occurs and repeat that for a couple of times and after that check if the node process is running because than you are sure the wrapper script did not restart it.

Does state PANIC mean the node process can not be restarted by the wrapper script? So in that case just restart.

What’s the average time it takes to complete the restart process once the wrapper script detects it needs to restart?

alexdupre · November 7, 2020, 8:45am

It’d be probably easier and more accurate to check if the log file modification time is older than 3 minutes.

Mash · November 7, 2020, 9:42am

Right, easy and clear
So if that’s the case the node process needs to be restarted?

alexdupre · November 7, 2020, 10:43am

To not be marked as offline, yes. You should also notify the xx team, though, because I suppose that was the rationale behind this “design”.