Go Code Review Request

Verdict from the bridge: the core NATS protocol path is battle-hardened, but the MQTT and internode/logging flanks have real crash-class ordnance lying around. This is not “lol Go bad.” It is mostly mature systems code. But a few places still trust length fields or environmental I/O like a reckless intern with a plasma torch.

I could not run the full suite because this tree requires Go 1.25.x (go.mod says Go 1.25/toolchain 1.25.11) and the container only had Go 1.23.2 with no network to fetch the newer toolchain. GOTOOLCHAIN=local go test ./... fails before tests start. I did run what could be run locally without external deps/toolchain upgrades: conf, internal/fastrand, internal/ldap, server/ats, server/avl, server/thw passed. logger tripped over unavailable local syslog. The rest below is static review plus small focused repro checks.

I am not going to print 128 production files’ worth of “nothing found.” That is how weak empires die in paperwork. Below is the file-by-file list of crash-relevant findings, then the files/subsystems I checked where the armor looked sane.

Highest-priority findings

File(s)	Severity	Finding
`server/mqtt.go`	Critical	Malformed MQTT `PUBLISH` can drive `pp.sz` negative, then panic in `make([]byte, 0, size)`.
`server/mqtt.go`	High	MQTT partial-packet buffering can grow/copy up to MQTT’s 268 MiB remaining-length ceiling before `max_payload` enforcement. Also O(n²) copying.
`server/route.go`, `server/gateway.go`, `server/parser.go`	High	Route/gateway peers can declare ~1 GB payloads and force huge allocation before any route/gateway max-payload check.
`logger/log.go`	High	Log rotation panics the whole process if reopen fails; `os.Rename` errors are ignored.
`logger/syslog.go`, `logger/syslog_windows.go`, `server/log.go`	Medium/High	`Fatalf` behavior depends on logger implementation: std logger exits, Windows syslog panics, Unix syslog merely logs and returns. Startup can “fatally” fail and then continue.
`conf/parse.go`, `server/opts.go`, `server/server.go`	Medium	Integer suffix overflow silently wraps; negative limit values survive validation and can brick service behavior.
`server/certstore/certstore_windows.go`	Medium	Windows cert-store path has `init()` panics and unsafe `&slice[0]` calls on buffers that are not checked for zero length.
`server/memstore.go`	Medium	Retention-limit enforcement contains a process-wide panic on internal memstore inconsistency. Not proven externally reachable, but the blast radius is wrong.

`server/mqtt.go`: malformed `PUBLISH` can panic the process

This is the ugliest one.

mqttParse() reads MQTT packet remaining length at lines 796-800, then dispatches the packet handlers over the same stream-wide buffer. It does not create a bounded sub-reader for the packet body.

The vulnerable chain:

mqttParse() reads pl with r.readPacketLen() and only checks that the declared packet is “complete” somewhere in the current buffer: server/mqtt.go:796-800.
mqttParsePub() starts at server/mqtt.go:4062.
It reads the topic using r.readBytes("topic", false) at server/mqtt.go:4076, but readBytes() bounds against the whole read buffer, not the MQTT packet’s declared end.
Then it computes:

Go
pp.sz = pl - (r.pos - start)

at server/mqtt.go:4127.

That can go negative if the topic field consumes bytes beyond the declared MQTT remaining length. The code never rejects negative pp.sz.

Then:

mqttProcessPub() computes max-payload using mqttComputeNatsMsgSize() at server/mqtt.go:4268-4271.
Negative total size is not > maxPayload, so it passes.
mqttInitiateMsgDelivery() calls mqttNewDeliverableMessage() at server/mqtt.go:4311-4312.
mqttNewDeliverableMessage() computes size := mqttComputeNatsMsgSize(...) at server/mqtt.go:4202.
It then does:

Go
buf := bytes.NewBuffer(make([]byte, 0, size))

at server/mqtt.go:4204.

If size < 0, Go panics. A panic in any unrecovered goroutine takes down the whole process. The galactic tribunal finds this displeasing.

A minimal packet shape after a valid MQTT CONNECT is enough:


PUBLISH fixed header:
  0x30        // QoS0 PUBLISH
  0x01        // Remaining Length = 1

Variable header bytes that the parser wrongly consumes anyway:
  0x00 0x20   // topic length = 32
  "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

For that frame, pl == 1, the variable header consumed is 2 + 32, so pp.sz == -33. The computed NATS message size is roughly 25 - 33 == -8, and make([]byte, 0, -8) detonates.

The existing tests around malformed MQTT remaining length cover fixed-size packets such as PINGREQ, PUBACK, PUBREC, PUBREL, and PUBCOMP in server/mqtt_test.go:1842-1902, but I did not see the equivalent malformed PUBLISH case. That is precisely where the blade is missing from the scabbard.

Fix: stop parsing MQTT packet bodies against the whole stream buffer. After reading pl, compute packet_end := r.pos + pl, pass only r.buf[r.pos:packet_end] to a bounded packet reader, require exact consumption, then set r.pos = packet_end. As a belt-and-suspenders guard, mqttParsePub() should reject pp.sz < 0. The boundary fix is the real fix; the negative check is the “do not summon demons” rune.

`server/mqtt.go`: MQTT partial-packet buffering can memory/GC-DoS before `max_payload`

The second MQTT wound is less elegant but more brutish.

mqttReader.reset() prepends saved partial packet bytes to the next read:

Go
tmp := make([]byte, l+len(buf))
copy(tmp, r.pbuf)
copy(tmp[l:], buf)

at server/mqtt.go:5818-5824.

readPacketLenWithCheck() saves the entire partial packet when the remaining length says the packet is incomplete:

Go
r.pbuf = make([]byte, len(r.buf)-r.pstart)
copy(r.pbuf, r.buf[r.pstart:])

at server/mqtt.go:5871-5872.

For CONNECT, PUBLISH, SUBSCRIBE, and UNSUBSCRIBE, mqttCheckRemainingLength() returns nil without enforcing a server-side max at server/mqtt.go:986-997. The MQTT remaining-length ceiling is 0x0fffffff / 268,435,455 bytes; the code even names that as mqttMaxPayloadSize at server/mqtt.go:94-95.

The server’s normal max_payload check for MQTT PUBLISH happens much later in mqttProcessPub() at server/mqtt.go:4268-4271, after the complete packet has already been buffered and parsed. For CONNECT, there is no comparable max_payload check.

This creates two problems:

Memory DoS: a client can declare a huge MQTT remaining length and drip bytes until pbuf grows very large.
O(n²) copy storm: every subsequent read allocates and copies the entire accumulated pbuf plus the new chunk. A 268 MiB partial packet accumulated in ~64 KiB chunks implies hundreds of GiB of aggregate copying before the server even gets to say “no.”

Default auth timeouts mitigate pre-auth cases only when MQTT auth is required. They do not solve authenticated MQTT clients or unauthenticated MQTT listeners.

Fix: enforce a transport-level maximum immediately after decoding the MQTT remaining length, before saving/accumulating partial bytes. For PUBLISH, the limit should account for NATS header overhead and max_payload. For non-publish control packets, pick a sane small cap; do not let CONNECT claim 268 MiB because the spec permits a varint that large. Also replace prepend-copy buffering with an append-style buffer that grows amortized O(n), but only after the limit check.

`server/route.go`, `server/gateway.go`, `server/parser.go`: route/gateway peers can force huge allocations

The normal client path is sane here. processPub() and processHeaderPub() enforce c.mpay before payload buffering:

server/client.go:2857-2872
server/client.go:2921-2925

Leaf nodes also enforce max payload before buffering:

server/leafnode.go:3172-3175
server/leafnode.go:3246-3249

Routes and gateways are different.

Route clients are created with mpay: -1:

Go
c := &client{..., kind: ROUTER, msubs: -1, mpay: -1, ...}

at server/route.go:1939.

Gateway clients are created without a max payload:

Go
c := &client{srv: s, nc: conn, start: now, last: now, kind: GATEWAY}

at server/gateway.go:810.

The routed message argument parsers read c.pa.size but do not enforce local MaxPayload:

processRoutedOriginClusterMsgArgs(): server/route.go:182-276
processRoutedHeaderMsgArgs(): server/route.go:280-372
processRoutedMsgArgs(): server/route.go:376-453

Then the generic parser allocates a split-message buffer with capacity derived directly from c.pa.size:

Go
c.msgBuf = make([]byte, lrem, c.pa.size+LEN_CR_LF)

at server/parser.go:1224.

parseSize() accepts up to 9 decimal digits, so a malicious/misconfigured route or gateway peer can declare 999999999 and force roughly a 1 GB backing array even if it only sends one byte of payload. parseSize() is at server/util.go:81-89.

This is probably not a public-internet unauthenticated client bug in a sane deployment; route and gateway ports are normally internal and authenticated/TLS-protected. But “trusted internode peer can OOM me with a length field” is still not a design anyone should defend before the imperial crash council.

Fix: route/gateway inbound payloads need a real receive limit. At minimum, enforce local opts.MaxPayload in all processRouted*MsgArgs() functions before the parser reaches make. Better: negotiate/use min(local MaxPayload, remote INFO MaxPayload) and reject peers that exceed it. The server already advertises MaxPayload in route/gateway INFO (server/route.go:2736, server/gateway.go:546); use the blasted thing.

`logger/log.go`: log rotation can panic the server

fileLogger.Write() handles rotation at logger/log.go:240-281.

If the file crosses the rotation limit, it closes the current file, renames it, then opens a new one:

Go
f, err := os.OpenFile(fname, fileflags, defaultLogPerms)
if err != nil {
	l.Unlock()
	panic(fmt.Sprintf("Unable to re-open the logfile %q after rotation: %v", fname, err))
}

at logger/log.go:266-269.

That means an environmental I/O failure during logging—permissions, deleted directory, filesystem error, bad mount state—can kill the entire NATS process. The server should not be sacrificed to the log-file goblin.

Also:

Go
os.Rename(fname, bak)

at logger/log.go:264 ignores the error. If rename fails but reopen succeeds, rotation state is now fiction: the log file may continue growing while l.out gets reset as if rotation worked.

NewFileLogger() also uses log.Fatalf() on open failure at logger/log.go:106-109. That is defensible for a CLI binary, but hostile if the package is ever used as a library. Less critical than the runtime rotation panic.

Fix: no panics in Write(). Handle rename/open errors as logged errors, switch to stderr/fallback, disable rotation temporarily, or keep the previous descriptor if it is still valid. Since log.Logger.Printf() does not propagate write errors to callers in a useful way, the file logger needs its own error state/fallback path.

`logger/syslog.go`, `logger/syslog_windows.go`, `server/log.go`: `Fatalf` is semantically cursed

The Logger interface includes:

Go
Fatalf(format string, v ...any)

at server/log.go:35-36.

Server.Fatalf() delegates to the configured logger:

Go
logger.Fatalf(format, v...)

at server/log.go:260-262.

But implementations disagree violently:

logger/log.go:389-390: std logger uses log.Fatalf, which logs and exits.
logger/syslog.go:110-113: Unix syslog writes Crit and returns.
logger/syslog_windows.go:89-92: Windows event logger writes error and panics.

So s.Fatalf() may exit, may panic, or may just shrug and return, depending on logger backend. That is not a logging abstraction; that is a random-number generator wearing a uniform.

This has visible consequences. Server.Start() has defer s.Noticef("Server is ready") at server/server.go:2274. If a startup fatal occurs under Unix syslog, s.Fatalf() can return, the function returns, and the defer can still log “Server is ready” after a fatal startup failure. Startup fatal call sites include pidfile/system-account/monitoring/resolver/JetStream/listener failures around server/server.go:2348, 2356, 2367, 2374, 2432, 2778, and 2808.

There is also a small variadic bug:

Go
s.Errorf(format, v)

at server/log.go:257 should be:

Go
s.Errorf(format, v...)

That path only runs during shutdown, but it is still a loose wire.

Fix: fatality should belong to Server, not to log backends. Have Server.Fatalf() log through a non-fatal logging method, then perform one consistent policy: os.Exit(1) for the binary, or return/store a fatal error for embedded mode. Do not encode process death inside interchangeable logger implementations.

`conf/parse.go`, `server/opts.go`, `server/server.go`: integer suffix overflow wraps silently

Config integer suffix handling multiplies parsed int64 values without overflow checks:

Go
case "k":
	setValue(it, num*1000)
...
case "eb", "ei", "eib":
	setValue(it, num*1024*1024*1024*1024*1024*1024)

at conf/parse.go:321-347.

I verified this locally:


"x: 9223372036854775807k"  => map[x:-1000], err=<nil>
"x: 9223372036854775807eb" => map[x:-1152921504606846976], err=<nil>

Then server/opts.go only rejects values above math.MaxInt32 for max_control_line and max_payload; it does not reject negative wrapped values:

server/opts.go:1328-1341

Baseline defaults only replace exact zero:

server/opts.go:6058-6065

And validation only checks:

Go
if int64(o.MaxPayload) > o.MaxPending

at server/server.go:1144-1146.

So negative values can survive into runtime. A wrapped max_control_line can make every client fail the control-line check. A wrapped max_payload can reject all publishes unless it happens to equal the sentinel jwt.NoLimit. A wrapped max_pending can produce other charming “why is the server eating itself?” behavior.

This is config-time, not remote unauthenticated network execution. Still a real sharp edge.

Fix: checked multiplication in the config parser. Reject overflow before setValue. Then reject negative values for all inherently nonnegative settings: max_payload, max_control_line, max_pending, log limits, buffer sizes, window sizes, and anything that later becomes an int32 or allocation size.

`server/certstore/certstore_windows.go`: Windows cert-store panic edges

The Windows cert-store backend panics during package initialization if DLL/proc loading fails:

Go
if err := d.Load(); err != nil {
	panic(err)
}
...
if err := p.Find(); err != nil {
	panic(err)
}

at server/certstore/certstore_windows.go:141-155.

The signing/export/property helpers also use &slice[0] without checking zero length:

&digest[0] in ECDSA signing: server/certstore/certstore_windows.go:568, 583
&buf[0] after make([]byte, size): server/certstore/certstore_windows.go:579, 585
RSA signing: server/certstore/certstore_windows.go:626, 641, 643, 663, 678, 680
winGetProperty(): server/certstore/certstore_windows.go:786-790
winExport(): server/certstore/certstore_windows.go:827-835

Normal TLS signing digests are non-empty, and Windows APIs should usually report nonzero output sizes. But the code is still built on “this can never be zero,” which is the phrase etched on the tombstones of many production processes.

Fix: no package init() panics for optional platform integration. Lazy-load with sync.Once and return errors. Guard len(digest) > 0 and size > 0 before taking element addresses. Use nil pointers or unsafe.SliceData patterns deliberately where Windows APIs allow zero-size buffers.

`server/memstore.go`: retention-limit inconsistency panics the whole process

deleteFirstMsgOrPanic() is exactly as subtle as a meteor strike:

Go
panic("jetstream memstore has inconsistent state, can't find first seq msg")

at server/memstore.go:1696-1699.

It is called by retention enforcement:

enforceMsgLimit(): server/memstore.go:1095-1104
enforceBytesLimit(): server/memstore.go:1109-1118

I did not prove an external path to corrupt that invariant. It may be a legitimate “this must never happen” tripwire. But in a server process, the correct blast radius for one stream/account’s corrupt in-memory state is not “terminate the entire daemon.” Especially since this is JetStream, where stream state can be isolated, sealed, marked corrupt, or administratively repaired.

One surrounding area worth extra race testing: removeMsg() deliberately unlocks around callbacks at server/memstore.go:2149-2163. That may be correct, but state mutation + callbacks + retention loops deserve race/fuzz attention.

Fix: return an error, mark the stream unhealthy/offline, emit an advisory, and preserve the rest of the server. Panic in tests; quarantine in production.

`server/parser.go`: minor defensive buglets

protoSnippet() drops the final byte when clamping:

Go
if stop > bufSize {
	stop = bufSize - 1
}
return fmt.Sprintf("%q", buf[start:stop])

at server/parser.go:1245-1254.

That should be stop = bufSize. Current behavior is not a crash; it just makes parser diagnostics worse, and can return an empty snippet when start points at the last byte.

overMaxControlLineLimit() logs len(c.argBuf) even though the offending buffer is arg:

Go
len(c.argBuf)

at server/parser.go:1271-1273.

Again, not critical. Merely an error-reporting gremlin.

`server/client.go` and `server/leafnode.go`: comparatively good armor

The regular NATS client PUB/HPUB path checks payload size before the parser allocates split-message buffers:

server/client.go:2857-2872
server/client.go:2921-2925

Leaf nodes do the same:

server/leafnode.go:3172-3175
server/leafnode.go:3246-3249

This is exactly the pattern routes/gateways should copy. The empire has the technology. It just forgot to deploy it at the border outposts.

`server/websocket.go`: mostly sane resource limits

The WebSocket path is not where I would send the first bug-fixing fleet.

Good signs:

Compressed WebSocket messages are accumulated under wsMaxMessageSize() at server/websocket.go:181-194.
Decompression uses io.LimitedReader with mpay + 1 at server/websocket.go:479-501.
Control frames cap payload length at wsMaxControlPayloadSize around server/websocket.go:307-314.
Uncompressed data is streamed into the normal parser in chunks rather than allocated as one giant declared frame.

I did not find a crash-class WebSocket allocation bug in the reviewed path.

`server/client_proxyproto.go`: bounded enough

PROXY protocol parsing allocates bounded buffers. v2 address length can lead to about 64 KiB allocation (uint16 addrLen) in parseIPv4Addr() / parseIPv6Addr() at server/client_proxyproto.go:381-415, but that is not process-threatening by itself. It also rejects too-short address data before slicing.

Not a critical finding.

`server/ats/ats.go`, `server/avl/seqset.go`, `server/thw/thw.go`

These packages passed their local tests under the available Go toolchain.

server/ats/ats.go has an invariant panic for unbalanced unregister at server/ats/ats.go:69, but I did not find a credible external path to trigger it. It is a sanity fuse, not currently a demonstrated remote bug.

server/avl/seqset.go decode paths check encoded lengths before walking buffers. On 64-bit builds, the uint32 node count multiplied by record size does not overflow int. On 32-bit platforms I would re-check overflow behavior, but that is not a primary NATS server target in 2026.

`server/stree/*`

The subject-tree implementation contains deliberate internal panics such as “node full” and “method called on leaf.” I did not find a direct external path from protocol input to those panics. They look like invariant assertions inside a specialized data structure.

I could not run those package tests because the code uses newer standard-library APIs not available in Go 1.23.

What I would patch first

Fix MQTT packet boundary parsing in server/mqtt.go. Every MQTT packet handler should parse a bounded slice of exactly its remaining length. Reject under-consumption and over-consumption. Add the malformed PUBLISH test that currently appears absent.
Add early MQTT size limits before pbuf accumulation. max_payload must be enforced before buffering a huge PUBLISH; CONNECT/SUBSCRIBE/UNSUBSCRIBE need their own sane caps.
Add route/gateway inbound payload checks matching client.go and leafnode.go. Never allocate c.pa.size-derived capacity from route/gateway input until it is checked.
Remove process-killing panics from logging. Runtime log rotation failure is an operational error, not divine permission to terminate the daemon.
Make Server.Fatalf() deterministic. Logger backends should log; the server should decide exit/panic/error semantics.
Add checked arithmetic to config suffix parsing and reject negative values for unsigned options.

Overall: the codebase is not garbage. The core protocol parser has signs of long combat experience. But the MQTT parser, route/gateway payload path, and logging fatality model are insufficiently paranoid. Those are the places I would send the extermination fleet first.

Chat history

You said:

ChatGPT said:

Highest-priority findings

server/mqtt.go: malformed PUBLISH can panic the process

server/mqtt.go: MQTT partial-packet buffering can memory/GC-DoS before max_payload

server/route.go, server/gateway.go, server/parser.go: route/gateway peers can force huge allocations

logger/log.go: log rotation can panic the server

logger/syslog.go, logger/syslog_windows.go, server/log.go: Fatalf is semantically cursed

conf/parse.go, server/opts.go, server/server.go: integer suffix overflow wraps silently

server/certstore/certstore_windows.go: Windows cert-store panic edges

server/memstore.go: retention-limit inconsistency panics the whole process

server/parser.go: minor defensive buglets

server/client.go and server/leafnode.go: comparatively good armor

server/websocket.go: mostly sane resource limits

server/client_proxyproto.go: bounded enough

server/ats/ats.go, server/avl/seqset.go, server/thw/thw.go

server/stree/*