一,理论
neil同学昨天抛给我一个小问题:利用hadoop从2个文件中提取出相同的条目。文件格式如下:
input1.txt
aaaa
bbbb
cc
11
input2.txt
aaa
bbbb
ccc
22
也就是求文件交集,重复的输出一次即可。其实如果对java和hadoop比较熟悉的话,这个问题应该还是有不少解法的,笔者2方面都不熟,这里只提出一种思路。
首先将还在本地的文件进行归并(《hadoop in action》有归并的例子),归并后直接上传到hdfs。不过由于要求2个文件的交集,所以归并过程中要进行一些标记,方便统计的时候进行识别,标记方法糙快猛,直接在将文件写到hdfs的时候加入我们的标记。注意我们处理数据是按行来的。但是hadoop提供的文件流FSDataInputStream不支持按行读取,来,跟我一起递归上溯其父类,直到挖出支持行读入的api,找啊找:
FSDataInputStream -> DataInputStream,结果瞬间发现其父类DataInputStream有readLine()方法,但是,它被标记为Deprecated了,通过说明得知:
readLine()
Deprecated. This method does not properly convert bytes to characters. As of JDK 1.1, the preferred way to read lines of text is via the BufferedReader.readLine()
method. Programs that use theDataInputStream
class to read lines can be converted to use the BufferedReader
class by replacing code of the form:
DataInputStream d = new DataInputStream(in);
with:
BufferedReader d
= new BufferedReader(new InputStreamReader(in));
DataInputStream d = new DataInputStream(in);
with: BufferedReader d
= new BufferedReader(new InputStreamReader(in));
那我们就去找BufferedReader类,它需要一个InputStreamReader的对象来构造自己,而InputStreamReader类又需要一个InputStream对象来进行构造。得出结论是我们需要一个InputStream。回头审视DataInputStream,没错,它的父类就是InputStream,根据java父类引用可以指向子类对象的特点我们知道我们可以把一个InputStream的子类传递给需要InputStream的地方去。这样就获得了按行读取的能力,每次读取到一行就加入我们的标记,第i个文件表示成i,结果如下所示。
readLine()
Deprecated. This method does not properly convert bytes to characters. As of JDK 1.1, the preferred way to read lines of text is via the BufferedReader.readLine()
method. Programs that use theDataInputStream
class to read lines can be converted to use the BufferedReader
class by replacing code of the form:
DataInputStream d = new DataInputStream(in);
with:
BufferedReader d
= new BufferedReader(new InputStreamReader(in));
DataInputStream d = new DataInputStream(in);with: BufferedReader d
= new BufferedReader(new InputStreamReader(in));
input1.txt
aaaa 0
bbbb 0
cc 0
11 0
input2.txt
aaa 1
bbbb 1
ccc 1
22 1
然后放到hadoop上按key进行处理,如果同一个key的val同时出现0和1,保留到输出文件中即可。
二,实现
没有评论:
发表评论