Join tables with different columns using Hive sql
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.
Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1
has columns A1....A1000
, table2
has columns A1,A3,A100,...A1000
, and A1001
(so some columns are common, some are unique for each table).
Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table
with columns A1...A1001
?
UPD: printSchema() command would show something like this:
root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)
When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.
In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.
sql amazon-web-services join hive union
add a comment |
I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.
Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1
has columns A1....A1000
, table2
has columns A1,A3,A100,...A1000
, and A1001
(so some columns are common, some are unique for each table).
Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table
with columns A1...A1001
?
UPD: printSchema() command would show something like this:
root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)
When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.
In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.
sql amazon-web-services join hive union
Please provide some short example of JSON and explain why did you create tables with different structure
– leftjoin
Nov 17 '18 at 8:06
I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.
– lugger1
Nov 19 '18 at 16:13
add a comment |
I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.
Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1
has columns A1....A1000
, table2
has columns A1,A3,A100,...A1000
, and A1001
(so some columns are common, some are unique for each table).
Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table
with columns A1...A1001
?
UPD: printSchema() command would show something like this:
root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)
When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.
In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.
sql amazon-web-services join hive union
I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.
Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1
has columns A1....A1000
, table2
has columns A1,A3,A100,...A1000
, and A1001
(so some columns are common, some are unique for each table).
Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table
with columns A1...A1001
?
UPD: printSchema() command would show something like this:
root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)
When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.
In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.
sql amazon-web-services join hive union
sql amazon-web-services join hive union
edited Nov 19 '18 at 16:44
lugger1
asked Nov 16 '18 at 20:37
lugger1lugger1
63921219
63921219
Please provide some short example of JSON and explain why did you create tables with different structure
– leftjoin
Nov 17 '18 at 8:06
I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.
– lugger1
Nov 19 '18 at 16:13
add a comment |
Please provide some short example of JSON and explain why did you create tables with different structure
– leftjoin
Nov 17 '18 at 8:06
I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.
– lugger1
Nov 19 '18 at 16:13
Please provide some short example of JSON and explain why did you create tables with different structure
– leftjoin
Nov 17 '18 at 8:06
Please provide some short example of JSON and explain why did you create tables with different structure
– leftjoin
Nov 17 '18 at 8:06
I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.
– lugger1
Nov 19 '18 at 16:13
I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.
– lugger1
Nov 19 '18 at 16:13
add a comment |
1 Answer
1
active
oldest
votes
1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.
Simple UNION ALL
:
select col1, from table
UNION ALL
select * from table 2
...
will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.
2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:
select A1, ... A1000, null A1001, null A1002 from table
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...
Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.
– lugger1
Nov 19 '18 at 21:23
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53345039%2fjoin-tables-with-different-columns-using-hive-sql%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.
Simple UNION ALL
:
select col1, from table
UNION ALL
select * from table 2
...
will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.
2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:
select A1, ... A1000, null A1001, null A1002 from table
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...
Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.
– lugger1
Nov 19 '18 at 21:23
add a comment |
1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.
Simple UNION ALL
:
select col1, from table
UNION ALL
select * from table 2
...
will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.
2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:
select A1, ... A1000, null A1001, null A1002 from table
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...
Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.
– lugger1
Nov 19 '18 at 21:23
add a comment |
1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.
Simple UNION ALL
:
select col1, from table
UNION ALL
select * from table 2
...
will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.
2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:
select A1, ... A1000, null A1001, null A1002 from table
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...
1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.
Simple UNION ALL
:
select col1, from table
UNION ALL
select * from table 2
...
will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.
2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:
select A1, ... A1000, null A1001, null A1002 from table
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...
edited Nov 19 '18 at 17:27
answered Nov 19 '18 at 17:21
leftjoinleftjoin
10.4k22356
10.4k22356
Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.
– lugger1
Nov 19 '18 at 21:23
add a comment |
Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.
– lugger1
Nov 19 '18 at 21:23
Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.
– lugger1
Nov 19 '18 at 21:23
Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.
– lugger1
Nov 19 '18 at 21:23
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53345039%2fjoin-tables-with-different-columns-using-hive-sql%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Please provide some short example of JSON and explain why did you create tables with different structure
– leftjoin
Nov 17 '18 at 8:06
I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.
– lugger1
Nov 19 '18 at 16:13