由编码识别遇到有关问题，思考utf8编码正则表达式（php版本）

由编码识别遇到问题，思考utf8编码正则表达式（php版本）

起因：

通过上面6个维度得到得到对应的正则表达式：

[\x01-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3}|[\xf8-\xfb][\x80-\xbf]{4}|[\xfc-\xfd][\x80-\xbf]{5}

以上分别是各个维度范围

<span style="line-height: 1.5;color: #0000ff"><?</span>php<span style="line-height: 1.5;color: #008000">//当前编码是gbk</span>$str="<span style="line-height: 1.5;color: #8b0000">袁</span>";echo urlencode($str);echo is_utf8($str);function is_utf8($str){	<span style="line-height: 1.5;color: #008000">///utf8编码正则检测函数</span>	<span style="line-height: 1.5;color: #008000">///copyright qq:8292669  http://www.cnblogs.com/chengmo</span>	$re='<span style="line-height: 1.5;color: #8b0000">/^([\x01-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3}|[\xf8<a>本2文来*源gao($daima.com搞@代@#码(网</a><strong>搞gaodaima代码</strong>-\xfb][\x80-\xbf]{4}|[\xfc-\xfd][\x80-\xbf]{5})+$/</span>';	return preg_match($re,$str);}

<strong><span style="line-height: 1.5;color: #ff0000">上面执行结果返回为1，然后”袁“本身应该是gbk编码。看来上面函数还是不能彻底检查utf8编码。分析原因，从上面正则可以看到，utf8的6个维度对应字节长度从1-6字节。 而gbk是1-2个字节。因此他们之间会在1-2个字节长度地方检查出现重合。1个字节的时候gbk与utf8的 编码与字符对应关系都一样，但是2个字节时候，对应编码与字符各不相同。</span></strong>

通过查询gbk编码表：http://www.knowsky.com/resource/gb2312tbl.htm 进一步确认，范围会在：

[c0-df][a0-bf]  之内汉字都会有问题了。 <strong>如果纯这个范围的汉字组合为字符串就会出现判断不了情况。如果它与其它范围字符组合都可以正确的判断出来。</strong>

<strong></strong>?

GBK与UTF8字符集重叠对应的字符是：（gbk编码表）

搞代码网（gaodaima.com）提供的所有资源部分来自互联网，如果有侵犯您的版权或其他权益，请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected]‍，我们会在看到邮件的第一时间内为您处理，或直接联系QQ：872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接：由编码识别遇到有关问题，思考utf8编码正则表达式（php版本）

GBK与UTF8字符集重叠对应的字符是：（gbk编码表）

Hi，您需要填写昵称和邮箱！