Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incusd crashes when a message is written to console websocket #1449

Open
2 tasks done
tsuzu opened this issue Dec 2, 2024 · 1 comment
Open
2 tasks done

incusd crashes when a message is written to console websocket #1449

tsuzu opened this issue Dec 2, 2024 · 1 comment
Labels
Bug Confirmed to be a bug
Milestone

Comments

@tsuzu
Copy link

tsuzu commented Dec 2, 2024

Required information

  • Distribution: Flatcar Linux 3941.0.0
  • Distribution version: 6.7
  • The output of "incus info" or if that fails:
incus info
config:
  cluster.https_address: <omitted>:8443
  core.https_address: :8443
  oidc.audience: <omitted>
  oidc.client.id: <omitted>
  oidc.issuer: <omitted>
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
- resources_cpu_flags
- disk_io_bus_cache_filesystem
- instance_oci
- clustering_groups_config
- instances_lxcfs_per_instance
- clustering_groups_vm_cpu_definition
- disk_volume_subpath
- projects_limits_disk_pool
- network_ovn_isolated
- qemu_raw_qmp
- network_load_balancer_health_check
- oidc_scopes
- network_integrations_peer_name
- qemu_scriptlet
- instance_auto_restart
- storage_lvm_metadatasize
- ovn_nic_promiscuous
- ovn_nic_ip_address_none
- instances_state_os_info
- network_load_balancer_state
- instance_nic_macvlan_mode
- storage_lvm_cluster_create
- network_ovn_external_interfaces
- instances_scriptlet_get_instances_count
- cluster_rebalance
- custom_volume_refresh_exclude_older_snapshots
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
- oidc
auth_user_name: root
auth_user_method: unix
environment:
  addresses:
  - <omitted>
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIB/zCCAYagAwIBAgIQWRCbU44hrisQmKspUyDV8jAKBggqhkjOPQQDAzAyMRkw
    FwYDVQQKExBMaW51eCBDb250YWluZXJzMRUwEwYDVQQDDAxyb290QHBhcHJpa2Ew
    HhcNMjQxMTI1MTg0ODUzWhcNMzQxMTIzMTg0ODUzWjAyMRkwFwYDVQQKExBMaW51
    eCBDb250YWluZXJzMRUwEwYDVQQDDAxyb290QHBhcHJpa2EwdjAQBgcqhkjOPQIB
    BgUrgQQAIgNiAASZI1i282jvdDzNX9aYzAw+cgPMVjH17A6a0VB/QQWaxn31W67h
    ee5CEkrGyFnyi6aiJJLuvpkJ39XWQgr5qEorIb9MgTsAmvJKTnBuZIMv4JQeChEi
    IGS/BbBJCCdlchWjYTBfMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggrBgEF
    BQcDATAMBgNVHRMBAf8EAjAAMCoGA1UdEQQjMCGCB3BhcHJpa2GHBH8AAAGHEAAA
    AAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDZwAwZAIwI/dyfbAzC8BrHmApElAt
    jLaZQoOQtrkfNQc7Njon+H6unHsjp8MO1hRDz/omhBE6AjA8XASjj6LD0hzMA4Pa
    ou1D30Yt9GrlVbt4bBbSdmKjqkjNzuCOzJC5fEQmrVNY7qU=
    -----END CERTIFICATE-----
  certificate_fingerprint: d87686e669cb458392f13f57e74da9f5c40d95b945514d61445daf4c18629d40
  driver: lxc | qemu
  driver_version: 6.0.2 | 9.0.3
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "false"
    unpriv_fscaps: "true"
  kernel_version: 6.6.21-flatcar
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Debian GNU/Linux
  os_version: "12"
  project: default
  server: incus
  server_clustered: true
  server_event_mode: full-mesh
  server_name: paprika
  server_pid: 215824
  server_version: "6.7"
  storage: lvm
  storage_version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.48.0
  storage_supported_drivers:
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.48.0
    remote: false
  - name: lvmcluster
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.48.0
    remote: true
  - name: btrfs
    version: "6.2"
    remote: false

Issue description

When I opened Text console of a VM/Container on Web UI and then moved to another pane like Configuration, incusd crashed with the following output:

panic: concurrent write to websocket connection

goroutine 12380 [running]:
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc00087de68, 0x1, {0x373a6c0?, 0xc00087de01?, 0xc00087de98?})
        /root/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:617 +0x4b8
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc0004f1340, 0x1fa654c?, {0x373a6c0, 0x0, 0x0})
        /root/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:770 +0x126
github.com/lxc/incus/v6/shared/ws.(*wrapper).Close(0xc0010620c0)
        /build/incus/shared/ws/rwc.go:92 +0x17f
github.com/lxc/incus/v6/shared/ws.MirrorRead.func1()
        /build/incus/shared/ws/mirror.go:38 +0x21d
created by github.com/lxc/incus/v6/shared/ws.MirrorRead in goroutine 12377
        /build/incus/shared/ws/mirror.go:32 +0x1e5

or

panic: concurrent write to websocket connection

goroutine 2786 [running]:
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc000871d48, 0x1, {0xc001957116?, 0x0?, 0xc000119508?})
        /root/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:617 +0x4b8
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc0019109a0, 0x0?, {0xc001957116, 0x2, 0x2})
        /root/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:770 +0x126
main.(*consoleWs).doConsole.func4()
        /build/incus/cmd/incusd/instance_console.go:324 +0x49
main.(*consoleWs).doConsole(0xc00104e1e0, 0x0?)
        /build/incus/cmd/incusd/instance_console.go:345 +0x4c2
main.(*consoleWs).Do(0xc00104e1e0, 0xc001178780)
        /build/incus/cmd/incusd/instance_console.go:202 +0x89
github.com/lxc/incus/v6/internal/server/operations.(*Operation).Start.func1(0xc001178780)
        /build/incus/internal/server/operations/operations.go:290 +0x26
created by github.com/lxc/incus/v6/internal/server/operations.(*Operation).Start in goroutine 2602
        /build/incus/internal/server/operations/operations.go:289 +0x105

Steps to reproduce

I can reproduce the problem by

  1. Open the incus web UI
  2. Open Text console of a container
  3. Move to another pane like Configuration
  4. Go back to Step 2

After some tries, incusd will crash.
However, I'm still looking for a way to reproduce this with CLI or a Go program.

Information to attach

  • [ ] Any relevant kernel output (dmesg)
  • [ ] Container log (incus info NAME --show-log)
  • [ ] Container configuration (incus config show NAME --expanded)
  • Main daemon log (at /var/log/incus/incusd.log)
    • attached above
  • [ ] Output of the client with --debug
  • Output of the daemon with --debug (alternatively output of incus monitor --pretty while reproducing the issue)
INFO   [2024-12-02T16:09:27Z] Action: instance-console, Source: /1.0/instances/testvm, Requestor: oidc/github|8574909 (172.71.24.104)  type=console
DEBUG  [2024-12-02T16:09:27Z] [paprika] Started mirroring websocket         address="172.71.24.99:59566"
DEBUG  [2024-12-02T16:09:27Z] [paprika] Websocket: Started read mirror      address="172.71.24.99:59566"
DEBUG  [2024-12-02T16:09:27Z] [paprika] Websocket: Started write mirror     address="172.71.24.99:59566"
INFO   [2024-12-02T16:09:27Z] Action: instance-console, Source: /1.0/instances/testvm, Requestor: oidc/github|8574909 (172.71.24.104)  type=console
DEBUG  [2024-12-02T16:09:27Z] [paprika] Handling API request                ip="162.158.119.126:25678" method=GET protocol=oidc url="/1.0/operations/6490c63e-db50-4f2c-909d-86b6285a363c/websocket?secret=044caddb2b7fae88fb9e9a96e96f94c7adbe8844d972bafd11910b1774d2588f" username="github|8574909"
DEBUG  [2024-12-02T16:09:27Z] [paprika] Connecting to operation             class=websocket description="Showing console" operation=6490c63e-db50-4f2c-909d-86b6285a363c project=default
DEBUG  [2024-12-02T16:09:27Z] [paprika] Connected to operation              class=websocket description="Showing console" operation=6490c63e-db50-4f2c-909d-86b6285a363c project=default
DEBUG  [2024-12-02T16:09:29Z] [paprika] Websocket: Stopped read mirror      address="172.71.24.99:59566" err="read unix:@->/run/incus/testvm/qemu.console: file already closed"
DEBUG  [2024-12-02T16:09:29Z] [paprika] Got error getting next reader: websocket: close 1005 (no status)
DEBUG  [2024-12-02T16:09:29Z] [paprika] Finished mirroring console to websocket  address="172.71.24.99:59566"
DEBUG  [2024-12-02T16:09:29Z] [paprika] Websocket: Sending barrier message  address="172.71.24.99:59566"
DEBUG  [2024-12-02T16:09:29Z] [paprika] Console control websocket finished
Error: websocket: close 1006 (abnormal closure): unexpected EOF
@tsuzu
Copy link
Author

tsuzu commented Dec 2, 2024

I found a potentical cause.
In doConsole, ws.Mirror for conns[0] is running on another goroutine.

// Mirror the console and websocket.
mirrorDoneCh := make(chan struct{})
go func() {
s.connsLock.Lock()
conn := s.conns[0]
s.connsLock.Unlock()
l := logger.AddContext(logger.Ctx{"address": conn.RemoteAddr().String()})
defer l.Debug("Finished mirroring websocket to console")
l.Debug("Started mirroring websocket")
readDone, writeDone := ws.Mirror(conn, console)
<-readDone
l.Debug("Finished mirroring console to websocket")
<-writeDone
close(mirrorDoneCh)
}()
// Wait until either the console or the websocket is done.
select {
case <-mirrorDoneCh:
close(consoleDisconnectCh)
case <-consoleDoneCh:
close(consoleDisconnectCh)
}
// Get the console and control websockets.
s.connsLock.Lock()
consoleConn := s.conns[0]
ctrlConn := s.conns[-1]
s.connsLock.Unlock()
defer func() {
_ = consoleConn.WriteMessage(websocket.BinaryMessage, []byte("\n\r"))
_ = consoleConn.Close()
_ = ctrlConn.Close()
}()

When consoleDoneCh is closed before conns[0] is closed, conns[0].WriteMessage can be potentially called from two goroutines:
doConsole goroutine

_ = consoleConn.WriteMessage(websocket.BinaryMessage, []byte("\n\r"))

and ws.Mirror goroutine

return w.conn.WriteMessage(websocket.TextMessage, []byte{})

so can't we remove the first WriteMessage, which is called without mutex?

@stgraber stgraber added the Bug Confirmed to be a bug label Dec 2, 2024
@stgraber stgraber added this to the incus-6.8 milestone Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Development

No branches or pull requests

2 participants