Golang中http.Client Transport配置与TIME_WAIT现象

2025年的第一篇文章,祝大家新年快乐(#^.^#)

本文用于记录在Golang中使用net/http包的Client时Transport配置不当,同时遇到大量并发请求时引起的TIME_WAIT问题,文章中会通过demo程序复现。

环境准备

此前定位的故障环境是物理机,系统是CentOS 7.6。众所周知,本地IP地址可使用的端口范围是有限的,可以通过如下命令查看:

1
2
3
4
5
$ sysctl -a | grep net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 60999
或者
$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 60999

那么可以计算得出本地可用端口为28231个。

这里为了便于复现场景,采用docker容器的形式进行,同时在启动容器时就限制容器内的net.ipv4.ip_local_port_range参数。

1
$ docker run -itd --sysctl net.ipv4.ip_local_port_range="32768    36768" golang:1.20.14-bullseye

这里允许本地可用端口为4000个,使用wrk工具可以非常容易的打满这个限制。

故障复现

为了复现故障,这里需要准备两个demo程序,都使用Gin框架提供HTTP服务。
1、backends程序,监听8088端口,用于提供的简单的ping、pong应答:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
package main

import "github.com/gin-gonic/gin"

func main() {
gin.SetMode(gin.ReleaseMode)
r := gin.Default()
r.GET("/ping", func(c *gin.Context) {
c.JSON(200, gin.H{
"message": "pong",
})
})
r.Run(":8088")
}

进行编译:

1
$ go build -o backends

2、transponder程序,监听8080端口,起到一个代理请求的作用,tping路由下的请求,会再向8088下的ping路由发起HTTP请求。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
package main

import (
"fmt"
"io"
"net/http"
"encoding/json"
"github.com/gin-gonic/gin"
)

type ping struct {
Message string `json:"message"`
}

func main() {
gin.SetMode(gin.ReleaseMode)
gin.DefaultWriter = io.Discard
r := gin.Default()

url := "http://127.0.0.1:8088/ping"

//直接使用默认的 http.Client
client := &http.Client{}

r.GET("/tping", func(c *gin.Context) {
req, err := http.NewRequest("GET", url, nil)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"error": fmt.Sprintf("http.NewRequest failed: %v", err),
})
return
}

resp, err := client.Do(req)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"error": fmt.Sprintf("get failed: %v", err),
})
return
}
defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {
c.JSON(resp.StatusCode, gin.H{
"error": fmt.Sprintf("request failed with status code: %d", resp.StatusCode),
})
return
}

body, err := io.ReadAll(resp.Body)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"error": fmt.Sprintf("read body failed: %v", err),
})
return
}

var tping ping
err = json.Unmarshal(body, &tping)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"error": fmt.Sprintf("unmarshal failed: %v", err),
})
return
}

c.JSON(200, gin.H{
"data": tping.Message,
})
})

r.Run(":8080")
}

进行编译:

1
$ go build -o transponder_v1

3、分别在上面准备好的容器内启动程序backends、transponder_v1,通过curl请求验证,服务都正常:

1
2
3
4
5
root@d76a1f6b4069:/opt# curl http://127.0.0.1:8088/ping
{"message":"pong"}

root@d76a1f6b4069:/opt# curl http://127.0.0.1:8080/tping
{"data":"pong"}

且环境中是的网络情况是较为简单的,只有启动的两个服务。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
root@d76a1f6b4069:/go# netstat -anlp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp6 0 0 :::8080 :::* LISTEN 953/./transponder_v
tcp6 0 0 :::8088 :::* LISTEN 579/./backends
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node PID/Program name Path

root@d76a1f6b4069:/go# ss -s
Total: 652
TCP: 29 (estab 0, closed 27, orphaned 0, timewait 1)

Transport Total IP IPv6
RAW 0 0 0
UDP 0 0 0
TCP 2 0 2
INET 2 0 2
FRAG 0 0 0

4、准备好wrk程序,并启动wrk程序进行压力请求,这里采用8个thread、600 connection,持续180秒:

1
2
3
4
5
6
7
8
9
10
11
root@d76a1f6b4069:/opt# ./wrk -t 8 -c 600 -d 180s http://127.0.0.1:8080/tping
Running 3m test @ http://127.0.0.1:8080/tping
8 threads and 600 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 574.59ms 611.03ms 2.00s 75.02%
Req/Sec 67.71 58.33 721.00 86.88%
96018 requests in 3.00m, 19.79MB read
Socket errors: connect 0, read 0, write 0, timeout 21152
Non-2xx or 3xx responses: 57687
Requests/sec: 533.16
Transfer/sec: 112.52KB

此处先贴上wrk执行结果,在执行中观察下面第五点的结果。

5、等待十几秒后,通过以下命令即可观察到TIME_WAIT现象:
1)多次curl请求8080下tping路由,均获得错误响应

1
2
root@d76a1f6b4069:/opt# curl http://127.0.0.1:8080/tping
{"error":"get failed: Get \"http://127.0.0.1:8088/ping\": dial tcp 127.0.0.1:8088: connect: cannot assign requested address"}

虽然transponder_v1没有崩溃,但在curl请求下,服务器已经无法正常返回pong结果,同时告知了错误。

2)ss命令

1
2
3
4
5
6
7
8
9
10
root@d76a1f6b4069:/go# ss -s
Total: 2637
TCP: 5730 (estab 1780, closed 3948, orphaned 0, timewait 3712)

Transport Total IP IPv6
RAW 0 0 0
UDP 0 0 0
TCP 1782 890 892
INET 1782 890 892
FRAG 0 0 0

通过ss的结果,可以看出,当前wrk压力请求产生的连接已经远超我们设置的本地端口范围4000个限制。

3)netstat筛选
通过上面ss的结果,可以看到建立了很多连接,可以筛选看下详细情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
root@d76a1f6b4069:/go# netstat -anlp | grep ESTABLISHED 
tcp6 45 0 127.0.0.1:8080 127.0.0.1:34111 ESTABLISHED 598/./transponder_v
tcp6 0 0 127.0.0.1:8080 127.0.0.1:34461 ESTABLISHED 598/./transponder_v
tcp6 0 0 127.0.0.1:8088 127.0.0.1:34845 ESTABLISHED 579/./backends
tcp6 45 0 127.0.0.1:8080 127.0.0.1:34635 ESTABLISHED 598/./transponder_v
tcp6 0 0 127.0.0.1:8080 127.0.0.1:34877 ESTABLISHED 598/./transponder_v
tcp6 45 0 127.0.0.1:8080 127.0.0.1:33851 ESTABLISHED 598/./transponder_v
tcp6 45 0 127.0.0.1:8080 127.0.0.1:34745 ESTABLISHED 598/./transponder_v
tcp6 45 0 127.0.0.1:8080 127.0.0.1:34001 ESTABLISHED 598/./transponder_v
tcp6 0 0 127.0.0.1:8080 127.0.0.1:33727 ESTABLISHED 598/./transponder_v
tcp6 45 0 127.0.0.1:8080 127.0.0.1:34981 ESTABLISHED 598/./transponder_v
tcp6 0 0 127.0.0.1:8088 127.0.0.1:36216 ESTABLISHED 579/./backends
tcp6 0 0 127.0.0.1:8088 127.0.0.1:36647 ESTABLISHED 579/./backends
.......

root@d76a1f6b4069:/go# netstat -anlp | grep TIME_WAIT
tcp 0 0 127.0.0.1:35690 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:33959 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:34872 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:33673 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:35497 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:35692 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:32972 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:34505 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:33402 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:36374 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:33676 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:33601 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:33902 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:33624 127.0.0.1:8088 TIME_WAIT -
tcp 0 0 127.0.0.1:36397 127.0.0.1:8088 TIME_WAIT -
........

root@d76a1f6b4069:/go# netstat -anlp | grep TIME_WAIT | wc -l
3768

root@d76a1f6b4069:/go# netstat -anlp | grep ESTABLISHED | wc -l
1666

通过分析netstat筛选的结果,可以看到8088端口产生了大量的TIME_WAIT,同时backends和transponder_v1之间也产生了大量的ESTAB,和ss的结果能够对应上。

分析现象

通过上面的步骤,已经复现出TIME_WAIT出现的现象,每个处于TIME_WAIT状态的连接会占用一个本地端口,当TIME_WAIT连接过多时,可能会导致本地端口资源耗尽,从而影响新连接的建立。过多的TIME_WAIT连接会增加系统内核的管理负担,影响系统的整体性能。

那么我们的demo也非常简单,同时根据curl请求返回的错误信息connect: cannot assign requested address,经过查找资料和源码阅读,那么问题出现在transponder_v1程序上。

来看这一行代码:

1
2
// 直接使用默认的 http.Client
client := &http.Client{}

从IDE跳过去以后,查看Client方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
type Client struct {
// Transport specifies the mechanism by which individual
// HTTP requests are made.
// If nil, DefaultTransport is used.
Transport RoundTripper
......
}

func (c *Client) transport() RoundTripper {
if c.Transport != nil {
return c.Transport
}
return DefaultTransport
}

其中说明了Transport指定了发出单个HTTP请求的机制。如果为空,则使用DefaultTransport。再继续往下看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// DefaultTransport is the default implementation of Transport and is
// used by DefaultClient. It establishes network connections as needed
// and caches them for reuse by subsequent calls. It uses HTTP proxies
// as directed by the environment variables HTTP_PROXY, HTTPS_PROXY
// and NO_PROXY (or the lowercase versions thereof).
var DefaultTransport RoundTripper = &Transport{
Proxy: ProxyFromEnvironment,
DialContext: defaultTransportDialContext(&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
}),
ForceAttemptHTTP2: true,
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
}

// DefaultMaxIdleConnsPerHost is the default value of Transport's
// MaxIdleConnsPerHost.
const DefaultMaxIdleConnsPerHost = 2

可以看出DefaultTransport是Transport的默认实现,由DefaultClient使用。它根据需要建立网络连接,并缓存这些连接供后续调用重复使用。但这里面有两个关键的参数设置MaxIdleConnsPerHostMaxIdleConns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
type Transport struct {
......
// MaxIdleConns controls the maximum number of idle (keep-alive)
// connections across all hosts. Zero means no limit.
MaxIdleConns int

// MaxIdleConnsPerHost, if non-zero, controls the maximum idle
// (keep-alive) connections to keep per-host. If zero,
// DefaultMaxIdleConnsPerHost is used.
MaxIdleConnsPerHost int

// MaxConnsPerHost optionally limits the total number of
// connections per host, including connections in the dialing,
// active, and idle states. On limit violation, dials will block.
//
// Zero means no limit.
MaxConnsPerHost int

// IdleConnTimeout is the maximum amount of time an idle
// (keep-alive) connection will remain idle before closing
// itself.
// Zero means no limit.
IdleConnTimeout time.Duration
......
}

其中MaxIdleConns默认设置连接池的大小为100个连接,MaxIdleConnsPerHost默认为2,导致只会保留2个连接,而将其他的连接主动关闭进入TIME_WAIT状态。而使用wrk进行压力请求,会产生很多TIME_WAIT状态的连接。最终会耗尽主机的所有可用端口,从而导致无法打开新的连接。进而产生错误返回connect: cannot assign requested address

所以需要根据实际情况,通过性能测试、参考经验值和生产监控去逐步调整MaxIdleConnsPerHostMaxIdleConns的参数。

程序优化

知道原因后,通过资料检索,可以对Transport进行配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
package main

import (
"fmt"
"io"
"net/http"
"encoding/json"
"github.com/gin-gonic/gin"
)

type ping struct {
Message string `json:"message"`
}

func main() {
gin.SetMode(gin.ReleaseMode)
gin.DefaultWriter = io.Discard
r := gin.Default()

url := "http://127.0.0.1:8088/ping"

defaultRoundTripper := http.DefaultTransport
defaultTransportPointer, ok := defaultRoundTripper.(*http.Transport)
if !ok {
fmt.Println("defaultRoundTripper not an *http.Transport")
return
}
defaultTransport := *defaultTransportPointer
defaultTransport.MaxIdleConns = 1000
defaultTransport.MaxIdleConnsPerHost = 1000
client := &http.Client{Transport: &defaultTransport}

// 直接使用默认的 http.Client
// client := &http.Client{}

r.GET("/tping", func(c *gin.Context) {
req, err := http.NewRequest("GET", url, nil)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"error": fmt.Sprintf("http.NewRequest failed: %v", err),
})
return
}

resp, err := client.Do(req)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"error": fmt.Sprintf("get failed: %v", err),
})
return
}
defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {
c.JSON(resp.StatusCode, gin.H{
"error": fmt.Sprintf("request failed with status code: %d", resp.StatusCode),
})
return
}

body, err := io.ReadAll(resp.Body)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"error": fmt.Sprintf("read body failed: %v", err),
})
return
}

var tping ping
err = json.Unmarshal(body, &tping)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"error": fmt.Sprintf("unmarshal failed: %v", err),
})
return
}

c.JSON(200, gin.H{
"data": tping.Message,
})
})

r.Run(":8080")
}

进行编译:

1
$ go build -o transponder_v2

停止transponder_v1的程序,运行transponder_v2程序,重新进行wrk压测,并在执行中观察curl和ss情况,会发现,TIME_WAIT现象已经消失。
1)wrk请求:

1
2
3
4
5
6
7
8
9
10
root@d76a1f6b4069:/opt# ./wrk -t 8 -c 600 -d 180s http://127.0.0.1:8080/tping
Running 3m test @ http://127.0.0.1:8080/tping
8 threads and 600 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 195.39ms 239.25ms 1.84s 90.57%
Req/Sec 606.10 214.25 1.30k 72.97%
531584 requests in 3.00m, 69.96MB read
Socket errors: connect 0, read 0, write 0, timeout 600
Requests/sec: 2951.69
Transfer/sec: 397.79KB

对比上面的wrk结果,优化后的wrk结果已经没有Non-2xx or 3xx responses,同时timeout数量明显减少。

2)执行多次curl请求,均能获取正常响应。

1
2
root@d76a1f6b4069:/opt# curl http://127.0.0.1:8080/tping
{"data":"pong"}

3)ss命令中大量TIME_WAIT同样消失。

1
2
3
4
5
6
7
8
9
10
root@d76a1f6b4069:/go# ss -s
Total: 3044
TCP: 2437 (estab 2400, closed 35, orphaned 0, timewait 9)

Transport Total IP IPv6
RAW 0 0 0
UDP 0 0 0
TCP 2402 1200 1202
INET 2402 1200 1202
FRAG 0 0 0

到这里故障复现就结束了,也找到了解决方法。

总结下经验,在排查问题时netstat和ss是非常有用的工具,通过他们去观察网络情况,可以很快的分析出具体问题。同时研发实现过程中一定要考虑好调用关系,做好自测和压测。

参考

1、https://studygolang.com/articles/28263
2、https://pkg.go.dev/net/http#Transport