修复Prometheus报错context deadline exceeded并开启热加载
今日greenplum的Prometheus出现报错:
Get "http://x.x.x.x:9297/metrics": context deadline exceeded
检查greenplum_exporter会有错误输出:
[root@localhost ~]# systemctl status greenplum_exporter
● greenplum_exporter.service - greenplum exporter
Loaded: loaded (/etc/systemd/system/greenplum_exporter.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2025-04-15 10:16:40 CST; 5h 0min ago
Main PID: 62580 (greenplum_expor)
CGroup: /system.slice/greenplum_exporter.service
└─62580 /usr/local/greenplum_exporter/bin/greenplum_exporter --log.level=error
Apr 15 10:16:40 localhost systemd[1]: Started greenplum exporter.
Apr 15 12:59:20 localhost greenplum_exporter[62580]: time="2025-04-15T12:59:20+08:00" level=error msg="get metrics for scraper:segment_scraper failed, error:pq: canceling statement du...ctor.go:100"
Apr 15 13:14:19 localhost greenplum_exporter[62580]: time="2025-04-15T13:14:19+08:00" level=error msg="get metrics for scraper:segment_scraper failed, error:pq: canceling statement du...ctor.go:100"
Hint: Some lines were ellipsized, use -l to show in full.
服务器端执行curl检查获取时间:
time curl http://127.0.0.1:9297/metrics
[root@localhost ~]# time curl http://127.0.0.1:9297/metrics
# HELP greenplum_cluster_active_connections Active connections of GreenPlum cluster at scape time
# TYPE greenplum_cluster_active_connections gauge
greenplum_cluster_active_connections 90
# HELP greenplum_cluster_active_connections_per_client Active connections of specified database user
# TYPE greenplum_cluster_active_connections_per_client gauge
greenplum_cluster_active_connections_per_client{client=""} 10
...省略...
greenplum_server_users_total_count 9
# HELP greenplum_up Whether greenPlum cluster is reachable
# TYPE greenplum_up gauge
greenplum_up 1
real 0m12.524s
user 0m0.000s
sys 0m0.005s
可以正常获取到数据,但是查询时间需要12秒。
- Prometheus默认每分钟pull一次数据(参数scrape_interval),但系统一般会配置15s;
- 默认抓取超时为10s(参数scrape_timeout),若10s内不能获取数据则会报错;
- scrape_timeout参数的值不能大于scrape_interval的值;
- scrape_interval可以设定全局也可以设定单个metrics;
- Prometheus以evaluation_interval规则周期性对告警规则做计算,然后更新告警状态;
- evaluation_interval只有设定在全局;
- 通过 http://127.0.0.1:7181/config 来查看Prometheus的配置参数;
config里面显示greenplum抓取时间配置
- job_name: greenplum
honor_timestamps: true
scrape_interval: 15s #默认抓取周期
scrape_timeout: 10s #默认抓取超时
metrics_path: /metrics
scheme: http
follow_redirects: true
具体参数可参考官方文档 https://prometheus.io/docs/prometheus/latest/configuration/configuration/
其中greenplum使用的是默认参数,15秒抓取一次,超时时间是10秒。
编辑Prometheus配置文件prometheus.yml,修改局部job配置:
- job_name: 'greenplum'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
scrape_interval: 60s #修改抓取周期时间
scrape_timeout: 30s #修改抓取超时时间
static_configs:
- targets: ['127.0.0.1:9297']
编辑 /etc/systemd/system/prometheus.service
服务文件
vim /etc/systemd/system/prometheus.service
[Unit]
Description=prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0:7181 --web.enable-lifecycle
Restart=on-failure
[Install]
WantedBy=multi-user.target
--web.listen-address=0.0.0.0:7181
指定端口为7181--web.enable-lifecycle
开启热加载,避免重启程序;此参数必须在最后
systemctl daemon-reload
systemctl start prometheus
systemctl status prometheus
调整完成后查看状态
[root@localhost ~]# systemctl status prometheus
● prometheus.service - prometheus
Loaded: loaded (/etc/systemd/system/prometheus.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2025-04-15 15:50:45 CST; 32s ago
Main PID: 29012 (prometheus)
CGroup: /system.slice/prometheus.service
└─29012 /data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0...
Apr 15 15:50:46 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:46.426Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12667 maxSegment=12670
Apr 15 15:50:46 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:46.877Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12668 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.071Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12669 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.074Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=12670 maxSegment=12670
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.074Z caller=head.go:773 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=30.828886m...=1.142973381s
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:815 fs_type=XFS_SUPER_MAGIC
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:818 msg="TSDB started"
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.091Z caller=main.go:944 msg="Loading configuration file" filename=/data2/doris/prometheus//prometheus.yml
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.092Z caller=main.go:975 msg="Completed loading of configuration file" filename=/data2/doris/prometheus//prome…µs
Apr 15 15:50:47 localhost prometheus[29012]: level=info ts=2025-04-15T07:50:47.092Z caller=main.go:767 msg="Server is ready to receive web requests."
Hint: Some lines were ellipsized, use -l to show in full.
[root@localhost prometheus]# ps -ef |grep prometheus
prometh+ 29012 1 8 15:50 ? 00:00:11 /data2/doris/prometheus/prometheus --config.file=/data2/doris/prometheus//prometheus.yml --storage.tsdb.path=/data2/doris/prometheus//data --web.listen-address=0.0.0.0:7181 --web.enable-lifecycle
root 47154 321416 0 15:52 pts/1 00:00:00 grep --color=auto prometheus
后续修改配置后可以使用热加载更新配置文件,无须重启服务
curl -XPOST http://127.0.0.1:7181/-/reload
查看Prometheus状态,恢复正常。