[原创] Apache Pig 加载/解析/读取 XML数据 – 编码无悔 / Intent & Focused

查看更多Apache Pig的教程请点击这里。

用Apache Pig加载/解析/读取XML数据不是一个常见的需求，但有时为了方便我们又不得不用，其实Pig的UDF库piggybank已经帮大家做好了这个工作了：只需要使用其中的XMLLoader，就可以方便地实现XML文件解析。

假设有如下XML文件：

<?xml version="1.0" encoding="UTF-8"?>

<users>
    <user>
        <profileUrl>http://www.codelast.com/user/profile/159.html</profileUrl>
        <data>
            <id>159</id>
            <name>李四</name>
        </data>
    </user>
    <user>
        <profileUrl>http://www.codelast.com/user/profile/220.html</profileUrl>
        <data>
            <id>220</id>
            <name>王五</name>
        </data>
    </user>

</users>

首先我们欣赏一下这个XML文件，它用于存储多个用户的信息，<users></users> 标签把所有用户信息包裹在里面，<user></user> 标签则对应一个用户的信息。在每一个用户信息中，不仅有 profileUrl 这种第一层的属性，还有 data/id，data/name 这种更深层的属性，我们在下面的代码中，会把这些字段全部提取出来。
这个XML虽然结构简单，但层次分明，通过解析这个XML文件，我们可以引申出对复杂XML文件的读取方法。
这里需要特别说明一下，通常情况下存储在磁盘上的大数据都不会是这种“格式化过”的数据格式（为了方便大家观看，我把XML格式化成了上面的样子），而是一种紧凑的格式，例如可能是下面这个样子：

<?xml version="1.0" encoding="UTF-8"?><users><user><profileUrl>http://www.codelast.com/user/profile/159.html</profileUrl><data><id>159</id><name>李四</name></data></user><user><profileUrl>http://www.codelast.com/user/profile/220.html</profileUrl><data><id>220</id><name>王五</name></data></user></users>

但是对piggybank的XMLLoader来说，上面的各种格式并不会影响它对XML内容的解析，这一点很好。
文章来源：https://www.codelast.com/
话不多说，直接上代码：

-- 注册piggybank的jar包，视实际情况修改此路径
REGISTER /xxx/piggybank.jar;

DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();

A = LOAD 'users.xml' USING org.apache.pig.piggybank.storage.XMLLoader('user') AS (x: chararray);
B = FOREACH A GENERATE XPath(x, 'user/profileUrl') AS profileUrl, XPath(x, 'user/data/id') AS id, XPath(x, 'user/data/name') AS name;
DUMP B;

代码解读：
➤ XMLLoader 的参数"user"，应该是你的XML文件中，有多个标签的那个名字，例如，在上面的XML中，含有多个 user，但是只有一个 users，所以我们这里填的应该是 user
➤ AS (x: chararray) 中的 "x" 其实可以是任意名称，我们下面只用这个名字来解析其下的字段，所以这个名字不重要。
➤ 在 XPath() 中我们可以通过XML的层级关系来取出指定的字段，一看就明白。
文章来源：https://www.codelast.com/
上面的代码输出：

(http://www.codelast.com/user/profile/159.html,159,李四)
(http://www.codelast.com/user/profile/220.html,220,王五)

也就是我们提取的3个字段。

文章来源：https://www.codelast.com/
➤➤ 版权声明 ➤➤
转载需注明出处：codelast.com
感谢关注我的微信公众号（微信扫一扫）：
wechat qrcode of codelast
以及我的微信视频号：

发表评论 取消回复

发表评论取消回复