Skip to content

DebuggingThread

Jc2k edited this page Oct 9, 2023 · 2 revisions

Debugging Thread

These notes assume you are debugging a Thread problem in the context of Home Assistant. They are focussed on HAP (HomeKit Accessory Protocol) but should be broadly applicable to any Thread based protocol.

This guide is aimed at devices you have connected to Thread but you are losing connections to them.

Preliminary steps

  • Are you using VLANs? Home Assistant must be on the same VLAN as your border routers. All of them must be on that VLAN. This is mandatory. (For Matter, the Matter container must also be on this VLAN). If you need to use a phone as part of the setup for your device, it must also be on this VLAN.

  • Are you using HAOS? You really should. Thread is not well supported on "stock" Linux.

    • Using NetworkManager? It doesn't support multiple border routers. Instead it will rotate which one it is using (potentially disrupting traffic). With more than 3 BR's, you'll see this at least once a minute. If your NetworkManager is older than the one in HAOS, it might have even worse ipv6 deficiences.
    • Using stock linux route advertisements? This can work, but not out of the box without setting sysctl like net.ipv6.conf.eth0.accept_ra_rt_info_max_plen to 64.
    • Using systemd-networkd? We know it has behaviour that is not consistent with the kernel. It might work better than unpatched NetworkManager. Please let us know how testing that goes.
    • Not using the HAOS kernel patches? If you run OTBR or have ip forwarding configured, Linux might disable the routing logic from paying attention to neighbour discovery data. Why is that a problem? When a route is not valid, it can stay in the route table until its TTL (time to live) expires. That is normally 30 minutes. So your network might break for 30 minutes every time your BR changes its link local ip address (you can't make those static). If it was consulting the neighbour cache, it would stop using that route in under a minute.
  • Using Apple border routers?

    • Make sure you are running iOS 17 on all your Apple routers. They have TREL. TREL lets the BR's mesh over WiFi and ethernet as well as thread. This basically solves mesh partitions.
    • Make sure your BR's are actually on the same network. One of mine learned about the non-IoT vlan and kept switching between networks, causing carnage.
    • Using a SkyConnect that is connected to your Apple BR's means only erratic support for TREL.

Prerequisites for debugging

Find your devices

We'll need to be able to poke at your problematic devices. To that we need an mDNS tool like this. We are interested in records in the _hap._udp namespace. All your border routers should be visible in _meshcop._udp. (For Matter, _matter._tcp).

Docker access

If you are running HA Core, you should have this covered. You should be able to SSH in to your Docker host and run "docker ps" and see your Home Assistant Container.

If you are running HAOS, it is a little trickier. You need to follow this guide to setup SSH access on port 22222.

When you've done this you should be able to use ssh on your Mac or Linux desktop (or PuTTY on Window) to remotely connect ot HAOS.

You need to get to the stage where you can run "docker ps" and see a list of containers running on your system.

Shell access

Run docker ps to see a list of containers running on your system.

Use docker exec -it NAME bash to get a shell inside your home assistant container. On HAOS, replace NAME with homeassistant.

Any changes you make in this container will persist until the next upgrade you do (Changes in /config will be permanent of coure).

Debug tools

You will need to do this every time you upgrade HA.

apk --no-cache add iproute2 tcpdump

Background

Thread is designed for consumer networks and to be plug and play. This is in tension with most "advanced" users who have "Professional" or "Enterprise" grade home networks.

Your home router is basically not involved in this process. Imagine that your BR's were connected by ethernet. In the ideal case, you'd have a single "dumb" switch which all your BR's are connected to, and HA would also be directly connected to that switch. We might stretch that basic model and have multiple switches and WiFi hotspots, but at its core that is the environment thread expects - a single flat network where everything is configured automatically over IPv6.

Border routers will self configure. In general, they will define their own ULA network on top of whatever network you have. So your computer might pick a link local address for itself (fe80), get an address from your home router and also get a 3rd address from yor border routers.

If you have public IPv6 your border routers might ask your router for a range of ip6 addresses of their own over DHCP6. When this happens your thread devices will get public ipv6 addresses in that range. Depending on the firewall configuration for your router, you might find they are reachable from the internet. But they will almost certainly be able to make outbound connections for themselves.

If you don't have public IPv6, your devices might still be able to make outbound network connections for themselves using IPv6 NAT (https://openthread.io/codelabs/openthread-border-router-nat64).

It's important to understand that the SkyConnect acts more like a Switch than a Router.

Capturing and understanding network state

Route tables

Use ip -6 route inside your HA container to see your route table.

Neighbour cache

Use ip -6 neigh inside your HA container to see your neighbour cache.

Address assignments

Use ip -6 a to see ipv6 addresses assigned to your interfaces.