修复Prometheus报错context deadline exceeded并开启热加载

Alex Qin

2025年04月15日

今日greenplum的Prometheus出现报错：

Get "http://x.x.x.x:9297/metrics": context deadline exceeded

检查greenplum_exporter会有错误输出：

[root@localhost ~]# systemctl status greenplum_exporter
● greenplum_exporter.service - greenplum exporter
   Loaded: loaded (/etc/systemd/system/greenplum_exporter.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2025-04-15 10:16:40 CST; 5h 0min ago
 Main PID: 62580 (greenplum_expor)
   CGroup: /system.slice/greenplum_exporter.service
           └─62580 /usr/local/greenplum_exporter/bin/greenplum_exporter --log.level=error

Apr 15 10:16:40 localhost systemd[1]: Started greenplum exporter.
Apr 15 12:59:20 localhost greenplum_exporter[62580]: time="2025-04-15T12:59:20+08:00" level=error msg="get metrics for scraper:segment_scraper failed, error:pq: canceling statement du...ctor.go:100"
Apr 15 13:14:19 localhost greenplum_exporter[62580]: time="2025-04-15T13:14:19+08:00" level=error msg="get metrics for scraper:segment_scraper failed, error:pq: canceling statement du...ctor.go:100"
Hint: Some lines were ellipsized, use -l to show in full.

服务器端执行curl检查获取时间：

time curl http://127.0.0.1:9297/metrics

[root@localhost ~]# time curl http://127.0.0.1:9297/metrics
# HELP greenplum_cluster_active_connections Active connections of GreenPlum cluster at scape time
# TYPE greenplum_cluster_active_connections gauge
greenplum_cluster_active_connections 90
# HELP greenplum_cluster_active_connections_per_client Active connections of specified database user
# TYPE greenplum_cluster_active_connections_per_client gauge
greenplum_cluster_active_connections_per_client{client=""} 10
...省略...
greenplum_server_users_total_count 9
# HELP greenplum_up Whether greenPlum cluster is reachable
# TYPE greenplum_up gauge
greenplum_up 1

real	0m12.524s
user	0m0.000s
sys	    0m0.005s

可以正常获取到数据，但是查询时间需要12秒。

Prometheus默认每分钟pull一次数据（参数scrape_interval），但系统一般会配置15s；
默认抓取超时为10s（参数scrape_timeout），若10s内不能获取数据则会报错；
scrape_timeout参数的值不能大于scrape_interval的值；
scrape_interval可以设定全局也可以设定单个metrics；
Prometheus以evaluation_interval规则周期性对告警规则做计算，然后更新告警状态；
evaluation_interval只有设定在全局；
通过 http://127.0.0.1:7181/config 来查看Prometheus的配置参数；

config里面显示greenplum抓取时间配置

- job_name: greenplum
  honor_timestamps: true
  scrape_interval: 15s #默认抓取周期
  scrape_timeout: 10s  #默认抓取超时
  metrics_path: /metrics
  scheme: http
  follow_redirects: true

具体参数可参考官方文档 https://prometheus.io/docs/prometheus/latest/configuration/configuration/

其中greenplum使用的是默认参数，15秒抓取一次，超时时间是10秒。

编辑Prometheus配置文件prometheus.yml，修改局部job配置：

  - job_name: 'greenplum'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    scrape_interval: 60s   #修改抓取周期时间
    scrape_timeout: 30s    #修改抓取超时时间
    static_configs:
    - targets: ['127.0.0.1:9297']

编辑 /etc/systemd/system/prometheus.service 服务文件

vim  /etc/systemd/system/prometheus.service

[Unit]
Description=prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0:7181 --web.enable-lifecycle
Restart=on-failure
[Install]
WantedBy=multi-user.target

--web.listen-address=0.0.0.0:7181 指定端口为7181
--web.enable-lifecycle 开启热加载，避免重启程序；此参数必须在最后

systemctl daemon-reload
systemctl start prometheus
systemctl status prometheus

调整完成后查看状态

[root@localhost ~]# systemctl status prometheus
● prometheus.service - prometheus
   Loaded: loaded (/etc/systemd/system/prometheus.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2025-04-15 15:50:45 CST; 32s ago
 Main PID: 29012 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─29012 /data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0...

Apr 15 15:50:46 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:46.426Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12667 maxSegment=12670
Apr 15 15:50:46 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:46.877Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12668 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.071Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12669 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.074Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12670 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.074Z caller=head.go:773 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=30.828886m...=1.142973381s
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:815 fs_type=XFS_SUPER_MAGIC
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:818 msg="TSDB started"
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:944 msg="Loading configuration file" filename=/data2/doris/prometheus//prometheus.yml
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.092Z caller=main.go:975 msg="Completed loading of configuration file" filename=/data2/doris/prometheus//prome…µs
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.092Z caller=main.go:767 msg="Server is ready to receive web requests."
Hint: Some lines were ellipsized, use -l to show in full.

[root@localhost prometheus]# ps -ef |grep prometheus
prometh+  29012      1  8 15:50 ?        00:00:11 /data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0:7181 --web.enable-lifecycle
root      47154 321416  0 15:52 pts/1    00:00:00 grep --color=auto prometheus

后续修改配置后可以使用热加载更新配置文件，无须重启服务

curl -XPOST http://127.0.0.1:7181/-/reload

查看Prometheus状态，恢复正常。