Start a conversation

Troubleshooting RTR/textpass Restarts, Hangs, or Segfaults Triggered by Near-Simultaneous TCAP-Aborts

Overview

RTR/textpass may become unresponsive for several seconds and then restart (watchdog abort), or in some cases crash with a segfault. In the investigated scenario, the most likely trigger pattern was a burst of multiple TCAP-Abort messages arriving nearly simultaneously (tens of milliseconds apart), associated with HLR overload/congestion signaling (map-userAbort → resourceUnavailable → shortTermResourceLimitation from SSN 6 to SSN 8).

This behavior was treated as an unconfirmed bug with race-condition-like characteristics in the TCAP abort-handling path. Because the affected environment was running an older/unsupported NewNet/Skyvera release, an engineering code fix on that legacy version was not available; upgrading to a supported release is required for engineering-level debug/hot patches and a potential final fix if the issue reproduces post-upgrade.

Solution

Contents

Symptoms

You may observe one or more of the following exact log messages.

Watchdog / health monitoring messages

textpass: Missing health signal, missed 2 signals, allowed 5
textpass: Missing health signal, missed 3 signals, allowed 5
textpass: Missing health signal, missed 4 signals, allowed 5
textpass: Missing health signal, missed 5 signals, allowed 5
textpass: Missing health signal, missed 6 signals, allowed 5
textpass: Missed 6 health signals: trying to abort process (pid=<pid>)

Related TCAP-abort/timeouts

textpass: SRI-SM request received TCAP Abort or timed out.

Crash signature (variant symptom)

kernel: textpass[<pid>]: segfault at <addr> ip <addr> sp <addr> error 14

Probable Trigger Pattern

Log and packet-capture analysis identified a consistent trigger pattern:

  • Multiple TCAP-Abort messages arriving very close together (example observed: two aborts ~54 ms apart).
  • The aborts matched HLR-side congestion/overload characteristics:
    • MAP dialogue: map-userAbort → resourceUnavailable → shortTermResourceLimitation
    • Source SSN: 6 (HLR)
    • Destination SSN: 8 (RTR acting as MSC/SMSC)
  • The failure may present as a several-second hang/unresponsiveness followed by watchdog termination/restart, or as a segfault, depending on timing (consistent with a race-condition-like defect in the TCAP abort-handling path).

Packet capture review confirmed SIGTRAN (SCTP/M3UA) carrying SCCP/TCAP/MAP traffic, including MAP sendRoutingInfoForSM, and TCAP-Abort messages near the event window.

How to Confirm It Matches This Issue

  1. Confirm watchdog restart behavior in syslog

    Look for the sequence starting with:

    • "textpass: Missing health signal, missed 2 signals, allowed 5"
  2. and progressing to:

    • "textpass: Missed 6 health signals: trying to abort process (pid=<pid>)"
  3. Correlate timing

    • Health is checked roughly every 1 second.
    • If the first observed log line is "missed 2 signals" at <timestamp_T>, then underlying unresponsiveness likely began at or before approximately <timestamp_T - 2 seconds>.
  4. Validate the signaling trigger in a PCAP

    In the packet-capture window preceding the hang/restart, confirm:

    • Presence of TCAP-Abort messages.
    • Inter-arrival times in the tens of milliseconds (near-simultaneous burst).
    • Abort cause pattern consistent with resourceUnavailable / shortTermResourceLimitation.

No additional logs or packet capture are required to support the “probable trigger” conclusion once the above correlation is established.

Mitigation Options

Option A: Reduce HLR overload/congestion that drives TCAP-Abort bursts

The observed abort pattern indicates short-term resource limitation from the HLR side. Reducing conditions that cause bursty TCAP-Aborts can reduce triggers for instability. This can include reviewing HLR-side load and congestion/resource thresholds (implementation specifics depend on your network).

Option B (MAP-IWK/RSDS use case): Return a MAP-layer error instead of a TCAP-Abort

For the specific RSDS/MAP-IWK context, a targeted mitigation is to change behavior so it returns a MAP-layer error (for example, “Teleservice Not Provisioned”) instead of returning a TCAP-layer TCAP-Abort.

Why this helps: The instability was associated with the TCAP abort-handling code path under near-simultaneous abort bursts. Using a MAP-layer error avoids executing that TCAP-Abort handling path, which is expected to reduce the likelihood of the observed failure mode.

Important considerations:

  • This changes signaling semantics (the meaning of the error changes).
  • Retry/discard behaviors may differ depending on downstream systems.
  • For the RSDS transmission failure scenario described, RSDS is not retransmitted regardless of the error returned; therefore, the impact on message delivery was assessed as negligible for that specific use case, and this change is expected to prevent the current mode of failure observed.

Verification Steps

After applying a mitigation (Option A or Option B), verify the following:

  1. Process stability

    Confirm RTR/textpass remains responsive and does not repeat the watchdog sequence:

    • "Missing health signal..." lines
    • watchdog abort/restart entries
  2. If using Option B

    Confirm via PCAP that, for the relevant exchange, TCAP-Aborts are no longer being generated and that a MAP-layer error is returned instead (as configured).

  3. Operational KPIs

    Confirm operational KPIs remain acceptable for the impacted flow, especially where the workaround changes error semantics.

Supported Release Requirement (Long-Term)

This issue was treated as an unconfirmed bug and was not reproduced in lab conditions. If a permanent fix is required, operate on a currently supported NewNet/Skyvera (Lithium) release so that engineering-level diagnostics and debug/hot patches are possible if the issue reproduces.

A practical approach is a controlled validation upgrade (for example, upgrade a single test or non-production node) and monitor for recurrence under real traffic.

Frequently Asked Questions

1. How do I know this is the same issue and not a different restart?
Look for the exact messages "textpass: Missing health signal, missed 2 signals, allowed 5" followed by "textpass: Missed 6 health signals: trying to abort process (pid=<pid>)", and correlate with a PCAP showing multiple TCAP-Aborts arriving within tens of milliseconds, commonly with map-userAbort → resourceUnavailable → shortTermResourceLimitation.
2. Why does the restart appear several seconds after the TCAP-Aborts?
Textpass can become unresponsive first. The watchdog detects missed health signals at ~1-second intervals (for example, “missed 2 signals”), and only after enough misses does it terminate/restart the process. This makes the visible restart lag behind the initial trigger.
3. Sometimes we see a watchdog restart, and other times we see a segfault. Are these related?
Yes. In the observed behavior, the same underlying trigger pattern (near-simultaneous TCAP-Abort bursts) can present as a hang/unresponsiveness followed by watchdog termination, or as a segfault, depending on timing. Both outcomes are consistent with a race-condition-like defect in the TCAP abort-handling path.
Choose files or drag and drop files
Was this article helpful?
Yes
No
  1. Priyanka Bhotika

  2. Posted

Comments