【kafka】__consumer_offsets部分分区异常导致消费不到数据问题排查---皮皮熊--瞎采新闻

记一次kafka消费异常问题的排查 https://github.com/pierre94/kafka-notes

一、问题描述问题描述

部分消费组无法通过broker(new-consumer)正常消费数据,更改消费组名后恢复正常。

group名(可能涉及业务信息，group名非真实名):

group1-打马赛克group2-打马赛克

kafka版本: 0.9.0.1

二、简单分析1、describe对应消费组

describe对应消费组时抛如下异常:

Error while executing consumer group command The group coordinator is not available. org.apache.kafka.common.errors.GroupCoordinatorNotAvailableException: The group coordinator is not available.2、问题搜索

搜索到业界有类似问题,不过都没有解释清楚为什么出现这种问题，以及如何彻底解决(重启不算)!

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Problem-with-Kafka-0-9-Client-td4975.html三、深入分析

日志是程序员的第一手分析资料。Kafka服务端因为现网有大量服务在运营，不适合开启debug日志，所以我们只能从客户端入手。

1、开启客户端debug日志

将客户端日志等级开成debug级别,发现持续循环地滚动如下日志:

19:52:41.785 TKD [main] DEBUG o.a.k.c.c.i.AbstractCoordinator - Issuing group metadata request to broker 43 19:52:41.788 TKD [main] DEBUG o.a.k.c.c.i.AbstractCoordinator - Group metadata response ClientResponse(receivedTimeMs=1587642761788, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@1b68ddbd, request=RequestSend(header={api_key=10,api_version=0,correlation_id=30,client_id=consumer-1}, body={group_id=30cab231-05ed-43ef-96aa-a3ca1564baa3}), createdTimeMs=1587642761785, sendTimeMs=1587642761785), responseBody={error_code=15,coordinator={node_id=-1,host=,port=-1}}) 19:52:41.875 TKD [main] DEBUG o.apache.kafka.clients.NetworkClient - Sending metadata request ClientRequest(expectResponse=true, callback=null, request=RequestSend(header={api_key=3,api_version=0,correlation_id=31,client_id=consumer-1}, body={topics=[topic打马赛克]}), isInitiatedByNetworkClient, createdTimeMs=1587642761875, sendTimeMs=0) to node 43

我们大致可以看出循环在做着几件事情(先后不一定准确):

从某个broker Issuing group metadata request获取Group metadata发起metadata request

我们聚焦到获取Group metadata的error关键字responseBody={error_code=15,coordinator={node_id=-1,host=,port=-1}},大致得出是kafka服务端没有给出coordinator的node结点信息。

2、服务端如何响应请求请求对应的入口函数

首先我们需要查看api_key=10请求对应的服务端源码:

需要从kafka.server.KafkaApis中寻找对应的api接口函数

def handle(request: RequestChannel.Request) { …… case RequestKeys.GroupCoordinatorKey => handleGroupCoordinatorRequest(request) …… }handleGroupCoordinatorRequest逻辑 def handleGroupCoordinatorRequest(request: RequestChannel.Request) { val groupCoordinatorRequest = request.body.asInstanceOf[GroupCoordinatorRequest] val responseHeader = new ResponseHeader(request.header.correlationId) if (!authorize(request.session, Describe, new Resource(Group, groupCoordinatorRequest.groupId))) { val responseBody = new GroupCoordinatorResponse(Errors.GROUP_AUTHORIZATION_FAILED.code, Node.noNode) requestChannel.sendResponse(new RequestChannel.Response(request, new ResponseSend(request.connectionId, responseHeader, responseBody))) } else { val partition = coordinator.partitionFor(groupCoordinatorRequest.groupId) // get metadata (and create the topic if necessary) val offsetsTopicMetadata = getOrCreateGroupMetadataTopic(request.securityProtocol) // 第一个可能存在的问题:offsetsTopicMetadata的errCode不为空 val responseBody = if (offsetsTopicMetadata.errorCode != Errors.NONE.code) { new GroupCoordinatorResponse(Errors.GROUP_COORDINATOR_NOT_AVAILABLE.code, Node.noNode) } else { val coordinatorEndpoint = offsetsTopicMetadata.partitionsMetadata .find(_.partitionId == partition) .flatMap { partitionMetadata => partitionMetadata.leader } // 第二个可能存在的问题:coordinatorEndpoint为空 coordinatorEndpoint match { case Some(endpoint) => new GroupCoordinatorResponse(Errors.NONE.code, new Node(endpoint.id, endpoint.host, endpoint.port)) case _ => new GroupCoordinatorResponse(Errors.GROUP_COORDINATOR_NOT_AVAILABLE.code, Node.noNode) } } trace("Sending consumer metadata %s for correlation id %d to client %s." .format(responseBody, request.header.correlationId, request.header.clientId)) requestChannel.sendResponse(new RequestChannel.Response(request, new ResponseSend(request.connectionId, responseHeader, responseBody))) } }

其中error_code=15对应的是Errors.GROUP_COORDINATOR_NOT_AVAILABLE.code

从源码不难看出，导致Errors.GROUP_COORDINATOR_NOT_AVAILABLE.code可能点有二:

疑似问题点一:offsetsTopicMetadata的errCode不为空 offsetsTopicMetadata.errorCode != Errors.NONE.code offsetsTopicMetadata的errCode不为空,意味着整个__consumer_offsets元数据获取都有问题。但是现场只是部分group有问题,这里出问题的可能性不大。疑似问题点二:coordinatorEndpoint为空 val coordinatorEndpoint = offsetsTopicMetadata.partitionsMetadata .find(_.partitionId == partition) .flatMap { partitionMetadata => partitionMetadata.leader } 从offsetsTopicMetadata获取到的元数据，过滤出coordinator.partitionFor(groupCoordinatorRequest.groupId)分区的leader。而coordinator.partitionFor(groupCoordinatorRequest.groupId)正是与group名相关!这里出问题的可能性极大!partitionFor相关的逻辑: def partitionFor(groupId: String): Int = Utils.abs(groupId.hashCode) % groupMetadataTopicPartitionCount

即取group名的正hashCode模groupMetadataTopicPartitionCount(即__consumer_offsets对应的分区数)。

注:可能涉及业务信息，group名非真实名。而结果是正式group名算出的结果。

scala> "group1-打马赛克".hashCode % 50 res2: Int = 43 scala> "group2-打马赛克".hashCode % 50 res3: Int = 43

我们发现2个异常的消费组,其partitionFor后的值均为43,我们初步判断分区可能与__consumer_offsets的43分区相关! 接下来我们就要看下offsetsTopicMetadata相关的逻辑,来确认异常。

offsetsTopicMetadata的逻辑val offsetsTopicMetadata = getOrCreateGroupMetadataTopic(request.securityProtocol)

getOrCreateGroupMetadataTopic -> metadataCache.getTopicMetadata -> getPartitionMetadata

private def getPartitionMetadata(topic: String, protocol: SecurityProtocol): Option[Iterable[PartitionMetadata]] = { cache.get(topic).map { partitions => partitions.map { case (partitionId, partitionState) => val topicPartition = TopicAndPartition(topic, partitionId) val leaderAndIsr = partitionState.leaderIsrAndControllerEpoch.leaderAndIsr val maybeLeader = aliveBrokers.get(leaderAndIsr.leader) val replicas = partitionState.allReplicas val replicaInfo = getAliveEndpoints(replicas, protocol) maybeLeader match { case None => debug("Error while fetching metadata for %s: leader not available".format(topicPartition)) new PartitionMetadata(partitionId, None, replicaInfo, Seq.empty[BrokerEndPoint], Errors.LEADER_NOT_AVAILABLE.code) case Some(leader) => val isr = leaderAndIsr.isr val isrInfo = getAliveEndpoints(isr, protocol) if (replicaInfo.size < replicas.size) { debug("Error while fetching metadata for %s: replica information not available for following brokers %s" }

offsetsTopicMetadata即对于topic下所有leader、replicaInfo、isr正常分区的元数据信息,所以我们判断__consumer_offsets 43分区leader、replicaInfo、isr等可能存在异常,导致find(_.partitionId == partition)时找不到根据hashCode取模后对应的分区。

四、回到现网1、__consumer_offsets分区信息验证

43分区果然存在leader异常的情况

2、问题复现

我们使用UUID批量生成消费组名,使其hashCode取模后为异常分区的分区号,再使用其进行消费时均出现消费异常的问题。

3、问题思考为什么__consumer_offsets部分分区会产生leader、replicaInfo、isr异常? 与网络抖动和一些集群操作可能有关，需要具体问题具体分析如何将__consumer_offsets异常分区恢复正常？这里不详细介绍可以参考http://blog.itpub.net/31543630/viewspace-2212467/ 。五、参考资料Kafka new-consumer设计文档 https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Detailed+Consumer+Coordinator+DesignKafka无法消费?!我的分布式消息服务Kafka却稳如泰山！http://blog.itpub.net/31543630/viewspace-2212467/Problem with Kafka 0.9 Client http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Problem-with-Kafka-0-9-Client-td4975.htmlErrorMapping https://github.com/apache/kafka/blob/0.9.0/core/src/main/scala/kafka/common/ErrorMapping.scala ---来自腾讯云社区的---皮皮熊

给这篇文章的作者打赏

关于作者: 瞎采新闻

相关文章

热门文章

1渗透利器 | 常见的WebShell管理工具---Bypass

2美国新冠病毒确诊人数统计及预测---用户5908113

3什么时候使用 useMemo 和 useCallback---Nealyang

4Lua table 如何实现最快的 insert?---poslua

5Swift 实现腾讯云 TC3-HMAC-SHA256 签名方法---韦弦zhy