分割gbk中文出现乱码的问题解决
近日遇到一个神奇的字“弢(tao)”。
具体的过程是这样的:
<span style="color: #008080">1</span> <span style="color: #800080">$list</span> = <span style="color: #008080">explode</span>('|', 'abc弢|bc'<span style="color: #000000">);</span><span style="color: #008080">2</span> <span style="color: #008080">var_dump</span>(<span style="color: #800080">$list</span>);
取得这个分割的结果。
和想象不同,结果居然是这样:
<span style="color: #0000ff">array</span>(3<span style="color: #000000">) { [</span>0]=> <span style="color: #0000ff">string</span>(4) "<span style="color: #000000">abc? [1]=> string(0) </span>""<span style="color: #000000"> [2]=> string(2) </span>"bc"<span style="color: #000000">}</span>
出现了乱码,而且莫名其妙的出现了一个空元素。
究其原因,原来这个字“弢”的gbk编码是8f7c,而|的ASCII是7c,这样explode就把弢的第二ASCII作为|切割了。
既然是双字节的问题,我们用mbstring解决好了。
可惜,php并没有mb_explode这种函数,找了找,找到一个mb_split。
<span style="color: #0000ff">array</span> mb_split ( <span style="color: #0000ff">string</span> <span style="color: #800080">$pattern</span> , <span style="color: #0000ff">string</span> <span style="color: #800080">$string</span> [, int <span style="color: #800080">$limit</span> = -1 ] )
没有声明编码的地方。仔细一看,他是通过mb_regex_encoding声明编码的。
于是写出以下的代码:
<span style="color: #008080">1</span> mb_regex_encoding('gbk'<span style="color: #000000">);</span><span style="color: #008080">2</span> <span style="color: #800080">$list</span> = mb_split('\|', 'abc弢|bc'<span style="color: #000000">);</span><span style="color: #008080">3</span> <span style="color: #008080">var_dump</span>(<span style="color: #800080">$list</span>);
结果php报错,mb_regex_encoding不认识gbk,囧。
那就使用它认识的:
<span style="color: #008080">1</span> mb_regex_encoding('gb2312'<span style="color: #000000">);</span><span style="color: #008080">2</span> <span style="color: #800080">$list</span> = mb_split('\|', 'abc弢|bc'<span style="color: #000000">);</span><span style="color: #008080">3</span> <span style="color: #008080">var_dump</span>(<span style="color: #800080">$list</span>);
结果:
<span style="color: #0000ff">array</span>(3<span style="color: #000000">) { [</span>0]=> <span style="color: #0000ff">string</span>(4) "<span style="color: #000000">abc? [1]=> string(0) </span>""<span style="color: #000000"> [2]=> string(2) </span>"bc"<span style="color: #000000">}</span>
发现,这种方法并没有什么用处。、
至于原因?“弢”这个字居然不在GB2312的编码集里面!!!!!但是有这个字的编码集(GBK, GB18030)这个函数都不支持!!!!!
既然这个不好用,也许万能的正则表达式是ok的。于是得到以下代码:
<span style="color: #008080">1</span> <span style="color: #008080">var_dump</span>(<span style="color: #008080">preg_match_all</span>('/([^\|])*/', 'abc弢|bc', <span style="color: #800080">$matches</span><span style="color: #000000">));</span><span style="color: #008080">2</span> <span style="color: #008080">var_dump</span>(<p style="color:transparent">本文来源gao!%daima.com搞$代*!码$网3</p><strong>搞代gaodaima码</strong><span style="color: #800080">$matches</span>);
结果:
int(2<span style="color: #000000">)</span><span style="color: #0000ff">array</span>(2<span style="color: #000000">) { [</span>0]=> <span style="color: #0000ff">array</span>(2<span style="color: #000000">) { [</span>0]=> <span style="color: #0000ff">string</span>(4) "<span style="color: #000000">abc? [1]=> string(2) </span>"bc"<span style="color: #000000"> } [1]=> array(2) { [0]=> string(1) </span>"?<span style="color: #000000"> [</span>1]=> <span style="color: #0000ff">string</span>(1) "c"<span style="color: #000000"> }}</span>
好吧,我想多了。
现在研究一下,如何用正则描述这个场景。
参考一下,鸟哥大神的博客:分割GBK中文遭遇乱码的解决。遗憾的是,正则能力比较low的我,还是想不出来合适的正则表达式(如果有想出这个正则表达式的大神们,希望可以告诉我)。
没办法,思来想去,只好用substr了:
<span style="color: #008080"> 1</span> <span style="color: #0000ff">function</span> mb_explode(<span style="color: #800080">$delimiter</span>, <span style="color: #800080">$string</span>, <span style="color: #800080">$encoding</span> = <span style="color: #0000ff">null</span><span style="color: #000000">){</span><span style="color: #008080"> 2</span> <span style="color: #800080">$list</span> = <span style="color: #0000ff">array</span><span style="color: #000000">();</span><span style="color: #008080"> 3</span> <span style="color: #008080">is_null</span>(<span style="color: #800080">$encoding</span>) && <span style="color: #800080">$encoding</span> =<span style="color: #000000"> mb_internal_encoding();</span><span style="color: #008080"> 4</span> <span style="color: #800080">$len</span> = mb_strlen(<span style="color: #800080">$delimiter</span>, <span style="color: #800080">$encoding</span><span style="color: #000000">);</span><span style="color: #008080"> 5</span> <span style="color: #0000ff">while</span>(<span style="color: #0000ff">false</span> !== (<span style="color: #800080">$idx</span> = mb_strpos(<span style="color: #800080">$string</span>, <span style="color: #800080">$delimiter</span>, 0, <span style="color: #800080">$encoding</span><span style="color: #000000">))){</span><span style="color: #008080"> 6</span> <span style="color: #800080">$list</span>[] = mb_substr(<span style="color: #800080">$string</span>, 0, <span style="color: #800080">$idx</span>, <span style="color: #800080">$encoding</span><span style="color: #000000">);</span><span style="color: #008080"> 7</span> <span style="color: #800080">$string</span> = mb_substr(<span style="color: #800080">$string</span>, <span style="color: #800080">$idx</span> + <span style="color: #800080">$len</span>, <span style="color: #0000ff">null</span>, <span style="color: #800080">$encoding</span><span style="color: #000000">);</span><span style="color: #008080"> 8</span> <span style="color: #000000"> } </span><span style="color: #008080"> 9</span> <span style="color: #800080">$list</span>[] = <span style="color: #800080">$string</span><span style="color: #000000">;</span><span style="color: #008080">10</span> <span style="color: #0000ff">return</span> <span style="color: #800080">$list</span><span style="color: #000000">; </span><span style="color: #008080">11</span> }
测试代码:
<span style="color: #008080">1</span> <span style="color: #800080">$a</span> = 'abc弢|bc'<span style="color: #000000">;</span><span style="color: #008080">2</span> <span style="color: #008080">3</span> <span style="color: #008080">var_dump</span>(mb_explode('|', <span style="color: #800080">$a</span>, 'gbk'<span style="color: #000000">));</span><span style="color: #008080">4</span> <span style="color: #008080">var_dump</span>(mb_explode('bc', <span style="color: #800080">$a</span>, 'gbk'<span style="color: #000000">));</span><span style="color: #008080">5</span> <span style="color: #008080">var_dump</span>(mb_explode('弢', <span style="color: #800080">$a</span>, 'gbk'));
结果:
<span style="color: #0000ff">array</span>(2<span style="color: #000000">) { [</span>0]=> <span style="color: #0000ff">string</span>(5) "abc弢"<span style="color: #000000"> [</span>1]=> <span style="color: #0000ff">string</span>(2) "bc"<span style="color: #000000">}</span><span style="color: #0000ff">array</span>(3<span style="color: #000000">) { [</span>0]=> <span style="color: #0000ff">string</span>(1) "a"<span style="color: #000000"> [</span>1]=> <span style="color: #0000ff">string</span>(3) "弢|"<span style="color: #000000"> [</span>2]=> <span style="color: #0000ff">string</span>(0) ""<span style="color: #000000">}</span><span style="color: #0000ff">array</span>(2<span style="color: #000000">) { [</span>0]=> <span style="color: #0000ff">string</span>(3) "abc"<span style="color: #000000"> [</span>1]=> <span style="color: #0000ff">string</span>(3) "|bc"<span style="color: #000000">}</span>
这样就可以得到正确的结果了。