I try to create a table from json file to hive by using org.apache.hive.hcatalog.data.JsonSerDe.
First, load file from local to HDF. Here is the code in Hive:
CREATE EXTERNAL TABLE tweet8(
user struct<userlocation:string, id:string, name:string>,
tweetmessage string,
createddate string,
geolocation string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/tmp/hive/hello';
Hive duplicates the records existing in my file except the last object. For example, in my text file, there are 4 JSON objects A, B, C, D; after loading in Hive, I have A, A, B, B, C, C, D.
According to my understanding, when loading file from local to HDFS, Hadoop creates number of replications. Based on these replications, we have duplication in Hive table. There are two solutions for the problem:
1 - set the replication factor to 1 when uploading file from local to HDFS;
2 - after create table, I do a SELECT Distinct query on the tweet8 table to create a new table without duplication.
Which is the best practice?
Thanks for any suggestion! (Feel free to ask if you need to clarify further my question and sorry for my bad english)