We now have 35 nodes on the network at my count. As many have noticed, the network seems to be crashing regularly and stability is an issue. We are releasing updates which address the most pressing of these issues.
We are releasing this update urgently, so it will go live today at 3:00PM PST to all those who accept automatic updates. We will also be making the update mandatory due to the number for issues it resolves.
The changelog is as follows:
Primitives
MR: https://gitlab.com/elixxir/primitives/-/merge_requests/102
- Rewrote the ringbuff package
- Simpler implementation
- Better testing of edge cases
Comms
MR: https://gitlab.com/elixxir/comms/-/merge_requests/245
- Updated network/dataStructures package for change to ring buffer
- Added support for overriding the IP address of an arbitrary member of the NDF in network
- Added overrideIP structure in network/dataStructures
- Added the structure to the network.instance object, it is used to overwrite IPs when they are added to the host map
- Modified connect/comms.go creator function to retry binding to the port if it is in use
- Improved logging in Connect package
- Updated internal dependencies
Wrapper
MR: https://gitlab.com/elixxir/wrapper/-/merge_requests/23/
- Added the ability to disable logging (contribution from community, h/t to Alex Dupre)
- Added the ability to configure the location of the logpath more granularly
- Added a 10 second sleep before restarting the binary
Gateway
Version: 1.3.0
MR: https://gitlab.com/elixxir/gateway/-/merge_requests/134
- Improved stability on initial connection to the Node
- Updated internal dependencies
Server
Version: 1.3.0
MR: https://gitlab.com/elixxir/server/-/merge_requests/546
- Modified the server by default it will overwrite the ip it uses to communicate with itself with its internal IP
- This can be disabled with the DisableIpOverride config flag
- Server checks if it can communicate with itself before beginning operation, reports an error if it cannot
- Improved handler when a round error is reported to the node
- Improved stability when the node cannot connect to Permissioning
- Updated internal dependencies
Next steps
We know of a few less pressing issues which we will try to fix next week. There seems to be a slow memory leak when GPU computation is enabled and there may be a thread leak. We also have seen the permisisoning server’s core scheduling thread lockup on two occasions.