修复Prometheus报错context deadline exceeded并开启热加载


今日greenplum的Prometheus出现报错:

Get "http://x.x.x.x:9297/metrics": context deadline exceeded

检查greenplum_exporter会有错误输出:

[root@localhost ~]# systemctl status greenplum_exporter
● greenplum_exporter.service - greenplum exporter
   Loaded: loaded (/etc/systemd/system/greenplum_exporter.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2025-04-15 10:16:40 CST; 5h 0min ago
 Main PID: 62580 (greenplum_expor)
   CGroup: /system.slice/greenplum_exporter.service
           └─62580 /usr/local/greenplum_exporter/bin/greenplum_exporter --log.level=error

Apr 15 10:16:40 localhost systemd[1]: Started greenplum exporter.
Apr 15 12:59:20 localhost greenplum_exporter[62580]: time="2025-04-15T12:59:20+08:00" level=error msg="get metrics for scraper:segment_scraper failed, error:pq: canceling statement du...ctor.go:100"
Apr 15 13:14:19 localhost greenplum_exporter[62580]: time="2025-04-15T13:14:19+08:00" level=error msg="get metrics for scraper:segment_scraper failed, error:pq: canceling statement du...ctor.go:100"
Hint: Some lines were ellipsized, use -l to show in full.

服务器端执行curl检查获取时间:

time curl http://127.0.0.1:9297/metrics
[root@localhost ~]# time curl http://127.0.0.1:9297/metrics
# HELP greenplum_cluster_active_connections Active connections of GreenPlum cluster at scape time
# TYPE greenplum_cluster_active_connections gauge
greenplum_cluster_active_connections 90
# HELP greenplum_cluster_active_connections_per_client Active connections of specified database user
# TYPE greenplum_cluster_active_connections_per_client gauge
greenplum_cluster_active_connections_per_client{client=""} 10
...省略...
greenplum_server_users_total_count 9
# HELP greenplum_up Whether greenPlum cluster is reachable
# TYPE greenplum_up gauge
greenplum_up 1

real	0m12.524s
user	0m0.000s
sys	    0m0.005s

可以正常获取到数据,但是查询时间需要12秒。

  • Prometheus默认每分钟pull一次数据(参数scrape_interval),但系统一般会配置15s;
  • 默认抓取超时为10s(参数scrape_timeout),若10s内不能获取数据则会报错;
  • scrape_timeout参数的值不能大于scrape_interval的值;
  • scrape_interval可以设定全局也可以设定单个metrics;
  • Prometheus以evaluation_interval规则周期性对告警规则做计算,然后更新告警状态;
  • evaluation_interval只有设定在全局;
  • 通过 http://127.0.0.1:7181/config 来查看Prometheus的配置参数;

config里面显示greenplum抓取时间配置

- job_name: greenplum
  honor_timestamps: true
  scrape_interval: 15s #默认抓取周期
  scrape_timeout: 10s  #默认抓取超时
  metrics_path: /metrics
  scheme: http
  follow_redirects: true

具体参数可参考官方文档 https://prometheus.io/docs/prometheus/latest/configuration/configuration/

其中greenplum使用的是默认参数,15秒抓取一次,超时时间是10秒。

编辑Prometheus配置文件prometheus.yml,修改局部job配置:

  - job_name: 'greenplum'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    scrape_interval: 60s   #修改抓取周期时间
    scrape_timeout: 30s    #修改抓取超时时间
    static_configs:
    - targets: ['127.0.0.1:9297']

编辑 /etc/systemd/system/prometheus.service 服务文件

vim  /etc/systemd/system/prometheus.service
[Unit]
Description=prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0:7181 --web.enable-lifecycle
Restart=on-failure
[Install]
WantedBy=multi-user.target
  • --web.listen-address=0.0.0.0:7181 指定端口为7181
  • --web.enable-lifecycle 开启热加载,避免重启程序;此参数必须在最后
systemctl daemon-reload
systemctl start prometheus
systemctl status prometheus

调整完成后查看状态

[root@localhost ~]# systemctl status prometheus
● prometheus.service - prometheus
   Loaded: loaded (/etc/systemd/system/prometheus.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2025-04-15 15:50:45 CST; 32s ago
 Main PID: 29012 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─29012 /data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0...

Apr 15 15:50:46 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:46.426Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12667 maxSegment=12670
Apr 15 15:50:46 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:46.877Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12668 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.071Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12669 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.074Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12670 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.074Z caller=head.go:773 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=30.828886m...=1.142973381s
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:815 fs_type=XFS_SUPER_MAGIC
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:818 msg="TSDB started"
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:944 msg="Loading configuration file" filename=/data2/doris/prometheus//prometheus.yml
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.092Z caller=main.go:975 msg="Completed loading of configuration file" filename=/data2/doris/prometheus//prome…µs
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.092Z caller=main.go:767 msg="Server is ready to receive web requests."
Hint: Some lines were ellipsized, use -l to show in full.
[root@localhost prometheus]# ps -ef |grep prometheus
prometh+  29012      1  8 15:50 ?        00:00:11 /data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0:7181 --web.enable-lifecycle
root      47154 321416  0 15:52 pts/1    00:00:00 grep --color=auto prometheus

后续修改配置后可以使用热加载更新配置文件,无须重启服务

curl -XPOST http://127.0.0.1:7181/-/reload

查看Prometheus状态,恢复正常。