It took a lot of trial and error but I traced the problem to my `serial_send` function.
The "dead" state of publishers and subscribers was intermittent. Occasionally they were missing, occasionally they were present.
The solution to my problem was to simply add two timeouts to the declaration of the Serial() object:
`self.serial_port = serial.Serial(serial_port_path, serial_baud_rate, timeout=0.01, write_timeout=0.01)`
The clue for the solution came from using the program `minicom` to communicate with my Arduino directly. Out of ideas, I tried connecting to the Arduino while the node was active. I rigged my Arduino code to produce *something* every time it received a message with a carriage-return. I noticed that spamming the Enter key, thereby spamming `\r`, I would occasionally see the other expected responses for sensor states, encoder states, motor efforts states, etc.
This could only occur if the ROS node was still sending messages. My best guess is that there was a deadlock condition created by my use of `Serial.read_until()`, assuming that a carriage-return would *always* be received. I did add a "fallback" of a 30-character limit hoping that this would account for unforeseen issues like this.
Based on the code I wrote I had no real reason to suspect otherwise, and I already knew that my Arduino *did* send a carriage-return for every response. Additionally that code had a maximum input buffer of 20 characters after which it would check its input, and regardless of its response it would *always* terminate that response with a carriage-return.
The real key was in adding the `write_timeout=0.01` parameter to the constructor. The rest of my ROS node's python code was written to log any caught exceptions to the log, and sure enough "write timeout" exceptions were finally being caught.
Only very occasionally am I observing these errors; I don't want to imply that my code is terribly broken. There is some nuance in the way PySerial works which I might be able to fix with a `sleep()` in the `serial_send` function. Currently these timeout exceptions only occur when the `serial_arbiter` node receives an overwhelming amount of messages, and has a difficult time keeping-up with its callbacks. Now at least the node has some added resiliency.
↧