BetaNet Update 9/18/2020

Hi!

A few weeks ago discord member and node operator, Alex Dupre, pointed out a series of flaws in the communications code and provided proposed improvements to address them. These bugs turn out to be responsible for a large portion of the network failures and we immediately looked into fixing them. Using Alex’s work as a template we re-architected the comms code to be simpler and easier to maintain while including his fixes. This code has been tested on multiple nodes and shows a ~⅓ decrease in failure rate in general for well performing nodes. Due to the fix only stopping failures caused by that specific node, we believe that when generally deployed this fix will almost entirely eliminate network failures.

xx_network/comms

MR: https://gitlab.com/xx_network/comms/-/merge_requests/26/diffs

ChangeLog:

  • /token
    • Token handling was moved to its own package
    • The manager has been rewritten and is now a map with a sync.rwMutex instead of a sync.Map because the map with a mutex is faster for the specific use case
    • A new live token object was added which is how tokens are stored on hosts. This ensures thread safety within the structure through the use of a mutex
    • Tokens are now defined as a 32 byte long byte array so they can be keyed on in the manager
    • Package fully tested
  • auth.go
    • Updated for changes to hosts and tokens
  • transmit.go
    • The sending and connection code has been completely rewritten to be goto free and significantly simpler. It is now generic between normal transmission and streaming communications.
    • A new send lock ensures the connection process is only occurring in one thread at a time.
    • The process fingerprints the connection when it starts to ensure that if communications fail and another send is handling sending, it will not interrupt
  • host.go
    • Hosts have been updated for the new tokens and their constructors have been updated with a more flexible paramatzation interface
  • manager.go
    • The host manager has been rewritten to use a map and a sync.rwMutex instead of a sync.Map because the map with a mutex is faster for the specific use case
    • Adding hosts does not overwrite a host if it already exists
    • Added the ability to explicitly remove hosts

Further work: it would be nice to restructure much of the comms functionality to be an operator of host and host and its manager should move to their own sub package. The circuit object should move to the network package of Elixxir.

elixxir/comms

MR: https://gitlab.com/elixxir/comms/-/merge_requests/289

ChangeLog:

  • Internal Dependencies Updated
  • A large number of suppressed error returns on endpoints were unsuppressed so the sender sees the errors and can handle them
  • Fixed streaming comms which packed authentication tokens before renegotiating authentication, now it does so after.

elixxir/server

MR: https://gitlab.com/elixxir/server/-/merge_requests/588/diffs

Version: 1.5.0

ChangeLog:

  • Fixed a crash when the node is coming back online as the permissioning server crashes
  • Internal Dependencies Updated
  • Updated for changes to hosts
  • Ensured passwords and registration codes are not printed to the logs

elixxir/gateway

MR: https://gitlab.com/elixxir/gateway/-/merge_requests/167

Version: 1.5.0

ChangeLog:

  • Internal Dependencies Updated
  • Updated for changes to hosts

Deployment:

This update will go out automatically to all nodes accepting automatic updates on 9/18/2020. This update does not break compatibility, but will decrease failure rate. As a result of being fully backwards compatible, we are releasing the update immediately. Nodes who do not install the update may find themselves disabled due to higher failure rates

Next Steps:

The Elixxir team is hard at work on getting the client functional with the network. We hope to have more news soon!. We are also working on integrations of cMix with xx consensus with the Praxxis team.

10 Likes

Awesome contributions, Alex Dupre! :100:

1 Like

Excellent improvement! I’m seeing almost 100% success rate now! :ok_hand:

1 Like

Класс!

Nice job! :partying_face:

супер класс!

Good stuff, success rate increasing significantly!

One observation to clarify: the gateway incoming/outgoing traffic volume increased 2-3 times after the update:

Nothing worrying about this, just curious if this is expected behaviour.

:clap: :clap: :clap: :tada:

Double TPS -> Double Traffic

Thanks, that would be logical, did not check the tps stats when posting, my bad.
After last update the traffic decreased due to less logs being sent, thought it was the opposite case this time around, thus the question.