silveradept | Adventures in Home Automation #14: No, YOU decided you wanted to restructure your network... (Reply)

Okay, so, the inciting incident for this is that my venerable and useful routers were declared End of Life by their manufacturer at the end of this year, which meant that the slightly aftermarket firmware that I was using on them would also be discontinued at the end of the year. Cue me trying to figure out whether or not I needed to purchase some replacement hardware, and what kind of network I would want to set up if I did need to do so. Mesh networking seemed like a useful option, given that there were, based on current router placement, a few dead spots in the house or places that got less than optimal signal, But also, so could the deployment of some access points and running cable to them from where the router was. But if I was running cable anyway, then I could just move the current router to a more optimal position and see if that managed the dead zones.

Then I discovered that a different aftermarket firmware supported my routers and was still under active development, so I eventually decided that I'd try to give the new firmware a go and keep the old infrastructure around, with a little bit of re-wiring so that a few new things joined the network switch that was already serving several items in a technology-heavy room. Flashing a completely new firmware, however, meant that all of the accumulated everything that had happened, and some of the ad-hoc decisions that I've made over time to stabilize or assign addresses to various pieces would all be wiped clean, and we could restructure the whole thing from the ground up. My local network expert (as in, the person who has been gently ribbing me about my reacting-to-the-last-crisis approach to networking) asked me to draw up an IP address assignment list and a description of the wireless networks that would be put into play before making any major decisions, so that when the new equipment/firmware went in, we'd be going in with a plan. So I did, and it met with their approval, pending some minor revisions to ensure expansion blocks were available if needed.

Once we'd come into agreement about what the new structure was going to look like, I waited until my vacation time and first tested the firmware upgrade process on the secondary router. Which helped me develop what the real process was that I needed to ensure a smooth upgrade of the primary router. Then the primary router got the upgrade, and that went smoothly. I went about finding and assigning devices their new IP addresses, which in some cases meant having to change their wireless network identification and re-set them up as if they were new devices. Which then meant futzing around with some developer websites or apps to ensure that the connections went through, after I'd set up all of the networks, including one guest network on each band, separated from the rest by a virtual local area network (VLAN) that was not going to get to see any of the computers on the regular networks that would be handling the Internet of Things traffic and the computers and devices that were not guests in the household. (My work laptop gets put on the guest WiFi, because it is a guest in my house and I don't want it to snoop.) While reworking IP addresses, we also came up with some new names for the non-guest networks, which means that I retired an old CRFH!!! Board joke as a network name in favor of a funnier joke involving Borg high-speed travel.

Then, of course, came the troubleshooting part.

I thought I had set the secondary router up correctly so that it would be assigned the right address by the primary router, but it wasn't showing up on the devices list, and eventually I figured it out from my search engine work that I needed to set the secondary device IP first to something that would get assigned an address from the DHCP pool, then I could set a rule that said this particular thing got this address, and then re-set the secondary device to use the address that had now been properly reserved for it. And that bridge is working quite merrily along. It only took a lot of reading troubleshooting threads on forums before someone finally said the magic words that made it all work. Isn't it nice being an information professional?

I also had to make sure that I pointed all of the voice assistants at the right new IP address of Home Assistant to make it all work, and for a moment, I was trying to figure out why I wasn't able to get vocal responses out of the assistant as I had before. Then I realized that there was a part of the configuration that still pointed at the wrong IP address that I hadn't changed over, and once I fixed that, everything was back up and running as smoothly as it had been before. I am beginning to understand the reasons why having good hostnames is helpful when pointing at various resources: if the IP address changes, but the hostname remains the same, then there's no issue if you've put the hostname in.

After troubleshooting and correcting the various things that hadn't gone perfectly according to plan, I discovered that the new firmware exposed significantly less data about itself and its operations to Home Assistant to read than the previous firmware had. I still wanted that data, and searching around showcased two possible solutions, one involving getting lots of data, publishing it to a database and then drawing lots of beautiful graphs in Home Assistant based on the data over time. The other involved MQTT and then establishing REST endpoints that the MQTT server would feed data to. Considering I had already established an MQTT server for the express purpose of turning the TV in the bedroom on and off, I went with the MQTT option.

By doing so, I was introduced to an entirely new suite of tools: Entware, which is apparently software for embedded systems that are unlikely to ever receive kernel upgrades or other fundamental system upgrades. Entware comes from Optware, which probably explains why all of the packages and such install to /opt, and can then be invoked from the router, or at the command line when you've connected to the router. As it is, because the amount of storage on an embedded system like a router is very small, Entware requests a flash drive or similar external storage unit be attached to the embedded system so that it can install itself and all of its packages and materials to the external storage.

I had a spare 2GB flash drive that was looking for usefulness, so I designated it for use in Entware, and that's when the F.U.N. began. It took me a good number of times to figure out how to actually get a secure shell on the router, and it mostly had to do with how a setting I'd changed for logging in to the router through the web administration portal didn't change the user name running on the shell. Once I was trying to log in as the right user, things went more smoothly.

Beyond that, however, the flash drive wasn't getting recognized when I plugged it into the port. I expected the router to work like other systems, where it would recognize the hotplug event and mount the drive appropriately. It didn't. Once I figured that part out, I left it plugged in, and when there weren't active people using bandwidth, rebooted the router. Which did recognize the drive, so that problem was solved.

Except that I'd formatted the drive as a FAT32 device, because that's usually what you format flash drives for to have maximum capability. Well, FAT as a filesystem doesn't actually work with Linux linking commands like ln, and so the Entware installation failed on that. Fine. The router says that it supports ext2/3/4 drives, and those will work with that, so I input the command to reformat the drive, except I'm working in BusyBox rather than the environment I'm more used to, which means I trip up here and there about the right syntaxes to use to make this work. I still manage to muddle my way through getting the device formatted in ext2, and then attempting to do it in ext3, so that it would have journaling capabilities in addition, but in trying to upgrade the ext2 to ext3, I did something wrong and the file system corrupts, and now the drive won't mount at all on the router.

Ugh, sigh, take the drive out, put it back in the useful situation, fsck it and find that the files on the drive are still apparently perfectly fine and intact, and therefore, if I can just find a way of sneaking the files off, I can then reformat the drive in ext4 according to the official install instructions, and then presumably put all of the Entware files and the MQTT scripts back on the drive and it will work perfectly fine, since Entware is self-contained and doesn't write anything to the embedded system itself. This rescue plan is complicated somewhat by the fact that plugging the drive into my computer did not do any kind of autmoounting or making the files immediately available, even after the fsck.

Just because it doesn't automount, however, doesn't mean it can't be mounted! Needed to invoke su to do it and do dangerous things as the superuser, like running the mount command, right off the terminal. And then, copying the material off the mounted drive into my user home directory. And then having to change the permissions and owner on the directory that I'd dumped all of those files into, and all of the files and subdirectories, recursively, so that I wouldn't have to mess with them as the superuser to get them back onto the drive. All in terminal, because the particular way of rescuing these files required the terminal to work. Which succeeded, because apparently I know what I'm doing and/or can look up the right syntaxes for using all of those commands to make them work. Which makes me wonder whether that whole signs you're becoming an advanced Linux User might be true, and that I might be a right and proper computer toucher who knows both how to get into trouble and out of it. (That's a scary thought.)

Anyway, I did manage to get the files off, properly owner- and permission-changed, and then reformatted everything on the drive in ext4, with the flag that the Entware installation wiki said was necessary for ext4 support to work on this particular firmware. And then, when properly reformatted and ready to go, the automounting worked again and I was able to dump the files back on to the drive and plug them back in to the router. This time, the drive was recognized and mounted appropriately on a router reboot, so I could do the rest of the Entware installation instructinos, to place the right commands in the right places to make sure that the drive would always mount to /opt when the router booted or rebooted, and from there, I finally started to work on the actual scripts that I had also downloaded (using wget on busybox first before the file system disaster, because they didn't have the right tools to clone the GitHub repository present in the Busybox or in Entware to do it. Nor did they have the right tools to use the scp command so that I could just ship the folder over from my laptop over the secure shell session and have it ready. Busybox is an entirely different environment than the usual distributions, oy.)

First order of business was setting up the configuration files so that they would point appropriately at the MQTT server that already existed and the Home Assistant server so that the MQTT messages and topics would be created appropriately and their data then slung to the endpoints for Home Assistant to ingest. That was easy enough, because Busybox had nano, my preferred console text editor, on it, and so I could put in the IP addresses, ports, and authentication methods and tokens. From there, I uncommented the appropriate lines to get the script combinations working and updating, set the router to run the starter script as a cron job every minute, and then checked Home Assistant to see if the endpoints were being properly created and the data fed to them. Which they were!

Until I rebooted Home Assistant. And then they were "unavailable" and supposedly no longer being provided by MQTT. Which didn't make any sense to me. When I reboot Home Assistant and/or my Pi, the switch I created with MQTT for the bedroom TV apparently works and continues to work fine, but these sensor units don't. So, back to the console on the router I went, to try and figure out what was happening. The scripts were still running on their timing, which was good, but when I re-ran one of the scripts to try and figure out what was going on, the script told me "nothing has changed" as one of the outputs, so the MQTT topics were still in existence, and they were probably still receiving feeds of information from the router and the scripting going on there. Where the breakage, then, was almost certainly in the REST endpoints not receiving data any more from the MQTT bits.

Luckily for me, I have experience with this particular situation happening, based on the scripting and setup that I did with the Daily Wisdom endpoint, and so I knew that when Home Assistant rebooted, it forgot all of the endpoints that had been temporarily created by things posting to endpoints that it hadn't been listening for. So if I used one of the scripts to completely remove all the entities that had been created by the scripts at the outset (the scripts that gather the data are very nice and output a text file with all of the MQTT topics they have created, which can be passed as a parameter into the removal script to make sure that all of those topics are removed, or only one of those topics is removed, if only one is desired), I could then reset the scripts to a "first-run" state, where they would recreate the MQTT topics and then the data would flow freely once again, until the next Home Assistant restart, which usually happens once a day in the early morning. So I cobbled together a hack by removing the interactive elements of the removal script, saving it with a different name, and then setting another cron job for the router to run the removal script at a specific point in time so that the other cron job will then recreate the topics and the sensors when it runs next. The flow of information will be interrupted for about a minute or two, but then everything will be right as rain again and the senors will be available.

I say this is a hack, because while I'm very confident it will work, I think it's the wrong way of going about the situation. The MQTT topics and messages are still running fine when Home Assistant reboots itself and forgets the temporary endpoints. What I suspect is the more durable, consistent, and less hacky method of getting things to work is to do the same thing that I did with the bedroom TV switch and define into the configuration files the sensors that the MQTT topics represent, so that when there's a reboot, Home Assistant will go looking for the topics that still exist and incorporate those, rather than relying on temporary endpoint creation from the scripts. With Home Assistant listening in the right place, the messages that are still there will show up immediately after the reboot.

According to the documentation, after the reboot, it's best practice for the MQTT broker to re-send the discovery message so Home Assistant will pick the sensors back up again. Which would suggest the break is actually happening at the machine that's managing the broker, and the correct script to run is the one on the broker machine in some kind of "when Home Assistant says 'I'm here and listening!', to re-send the discovery messages that are already present in the broker. Which would mean enumerating over all the things that are there, reading their current config, and then republishing those things. That sounds like it should be doable, although the file with the enumeration is on the router, not on the broker machine. At this point, I think I don't understand the protocols and the configuration and discovery messages well enough to change the removal script into a republishing script. (What I would do in that situation is probably to amalgamate all the publishing commands from each of the individual scripts, somehow, and then just run them all in a row. Which, no. There's a better provided solution here, just run the removal script once, and then in the next minute, everything will repopulate, and the links will be re-established until the next reboot. Assuming that the next reboot of Home Assistant is the scheduled one, of course.)

The use of the second task to clear the entity field and then repopulate it later fell over and sank into the swamp. Still "unavailable" sensors. It should work, right? I pasted in the correct command that worked from the console, but it didn't do what I wanted it to when it was set as a scheduled job. Even though there should have been ample enough time for everything to happen correctly with those scripts working, we went back to the drawing board. Which involved trying to modify the one script that was working to run the appropriate script when a time condition was met. Unfortunately for me, when I ran that from the terminal, it worked, but when the scheduler ran it, it fell over and sank into the swamp. Much like trying to run the other script had. Whether these situations were all problems because the removal script needed an argument passed to it, and the cron / scheduler wasn't interpreting it correctly, or because cron didn't invoke an actual shell or have permissions to use the tools that were available (or didn't know where those tools were and I wasn't telling the script or it properly), or any number of other possible reasons, things just weren't working on the pathway I wanted to have happen.

But then I noticed something that hadn't pinged before. Rereading the documentation of the project I was using, it mentioned that the scripts would compare whether or not what they had discovered to a file that got written with everything that had been sent, discovered, or otherwise set up during the running of the other scripts. So I wondered if I could simplify my life much more significantly by just deleting the file with all of the established discoveries and then see if new discovery messages would be sent. I set up the conditions (a reboot of Home Assistant, which I needed to do anyway to clear out some things that weren't real settings any more, and then a manual delete of the file in question from the terminal) and when the scheduler ran the script on the minute, all of my "unavailable" sensors reconnected themselves to Home Assistant! Deleting the file meant that the scripts that were sending data re-sent their discovery messages along with for first establishment. Which meant my first solution, of a second scheduled cron job, might work just fine to get everything reconnected. So I set up the scheduler to run one "delete file" command at the first interval past when the Home Assistant reset itself, and then, after a night's sleep to test it, found that it had worked appropriately - the sensors had repopulated after the break in between when Home Assistant rebooted and the scheduled task on the router ran to remove the file that held all the things that had been already populated to the MQTT broker.

[VICTORY FANFARE GOES HERE, IF A BIT CONFUSED.]

This one definitely goes in the column of "if it works, you've succeeded." There's got to be a better, cleaner, more elegant solution that somehow manages to notice when Home Assistant comes back on-line after the reboot and knows to rebroadcast all the discovery messages that have happened before, so Home Assistant jumps back in to understanding. Or I need to get better knowledge of how the discovery bits are structured so that I can turn them into sensors that know to seek their own discovery after a restart or have it already in the Home Assistant configuration what topics to listen to for their values. Something that's both automated and flexible enough to adapt to the circumstances and to work with the tools that are available to me. Some of the documentation and community posts I've read about this suggests, however, that it is not that simple to collect a list of what topics have been generated so far on an MQTT broker. And if I can't get it to go in Busybox, then I'd probably have to do something on Entware, and that would still mean writing a script that's specifically listening for something to fire so that it can do something in return. The elegant solution has a significantly larger amount of complexity in it than the simple one, and if I really wanted something that was truly flexible and responsive, I could just set the router to run the command that deletes the file of things that have already been set up at regular intervals, so the topics would be continuously rebroadcast and never more than so many minutes away from coming back online, regardless of when I restarted Home Assistant. That sounds like a lot of unnecessary network traffic, though.

So I've done something else to make my Home Assistant more full of data, using a communication protocol that I've already set up on one machine and a suite of scripts that someone else has already designed for use along with what is essentially an add-on system for the router. And then figured out how to (inelegantly) ensure that the sensor data would continue flowing after a scheduled restart of Home Assistant. Now that I'm on the other side of it, and of the network restructuring that took place before it, I can see how long this could have taken, had I gone straight to the correct (or at least the working) solution immediately, but a lot of how I get to the solutions I either accept or use until a better solution presents itself is reading docs and fucking around and finding out. Which, through its iterative nature, takes time and frustration and thinking and coming back to a problem after sleeping on it for a bit. And accepting that even if this solution is not the correct one, it does not necessarily mean that there isn't one or that I cannot find the correct one. And sometimes it means research. Not succeeding the first time, or the hundredth time, does not mean I am permanently a useless failure at everything. It only means that I have not succeeded this time at this one thing. (Which can be hard to remember when the weasels are biting.)