BetaNet Update 7/2/2020: Stability improvements

We now have 35 nodes on the network at my count. As many have noticed, the network seems to be crashing regularly and stability is an issue. We are releasing updates which address the most pressing of these issues.

We are releasing this update urgently, so it will go live today at 3:00PM PST to all those who accept automatic updates. We will also be making the update mandatory due to the number for issues it resolves.

The changelog is as follows:

Primitives

MR: https://gitlab.com/elixxir/primitives/-/merge_requests/102

  • Rewrote the ringbuff package
    • Simpler implementation
    • Better testing of edge cases

Comms

MR: https://gitlab.com/elixxir/comms/-/merge_requests/245

  • Updated network/dataStructures package for change to ring buffer
  • Added support for overriding the IP address of an arbitrary member of the NDF in network
    • Added overrideIP structure in network/dataStructures
    • Added the structure to the network.instance object, it is used to overwrite IPs when they are added to the host map
  • Modified connect/comms.go creator function to retry binding to the port if it is in use
  • Improved logging in Connect package
  • Updated internal dependencies

Wrapper

MR: https://gitlab.com/elixxir/wrapper/-/merge_requests/23/

  • Added the ability to disable logging (contribution from community, h/t to Alex Dupre)
  • Added the ability to configure the location of the logpath more granularly
  • Added a 10 second sleep before restarting the binary

Gateway

Version: 1.3.0

MR: https://gitlab.com/elixxir/gateway/-/merge_requests/134

  • Improved stability on initial connection to the Node
  • Updated internal dependencies

Server

Version: 1.3.0

MR: https://gitlab.com/elixxir/server/-/merge_requests/546

  • Modified the server by default it will overwrite the ip it uses to communicate with itself with its internal IP
    • This can be disabled with the DisableIpOverride config flag
  • Server checks if it can communicate with itself before beginning operation, reports an error if it cannot
  • Improved handler when a round error is reported to the node
  • Improved stability when the node cannot connect to Permissioning
  • Updated internal dependencies

Next steps
We know of a few less pressing issues which we will try to fix next week. There seems to be a slow memory leak when GPU computation is enabled and there may be a thread leak. We also have seen the permisisoning server’s core scheduling thread lockup on two occasions.

6 Likes

The new wrapper wasn’t released with tonight auto update.

It’d be nice to merge this, too, for next update: https://gitlab.com/elixxir/wrapper/-/merge_requests/24

How to know if the server has been updated?

In the cmdlogs I do see bin/server-1.3.0.binary so I guess it has been updated?

$ /opt/xxnetwork/bin/xxnetwork-node version | head -n1
xx network Server v1.3.0 -- 9605bee7 fixed a bad print

we decided not to release because a failed wrapper update can require a lot of work, so we will release it on Monday