Join tables with different columns using Hive sql





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.



Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1 has columns A1....A1000, table2 has columns A1,A3,A100,...A1000, and A1001 (so some columns are common, some are unique for each table).



Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table with columns A1...A1001?



UPD: printSchema() command would show something like this:



root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)


When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.



In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.










share|improve this question

























  • Please provide some short example of JSON and explain why did you create tables with different structure

    – leftjoin
    Nov 17 '18 at 8:06











  • I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

    – lugger1
    Nov 19 '18 at 16:13


















0















I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.



Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1 has columns A1....A1000, table2 has columns A1,A3,A100,...A1000, and A1001 (so some columns are common, some are unique for each table).



Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table with columns A1...A1001?



UPD: printSchema() command would show something like this:



root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)


When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.



In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.










share|improve this question

























  • Please provide some short example of JSON and explain why did you create tables with different structure

    – leftjoin
    Nov 17 '18 at 8:06











  • I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

    – lugger1
    Nov 19 '18 at 16:13














0












0








0


1






I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.



Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1 has columns A1....A1000, table2 has columns A1,A3,A100,...A1000, and A1001 (so some columns are common, some are unique for each table).



Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table with columns A1...A1001?



UPD: printSchema() command would show something like this:



root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)


When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.



In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.










share|improve this question
















I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.



Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1 has columns A1....A1000, table2 has columns A1,A3,A100,...A1000, and A1001 (so some columns are common, some are unique for each table).



Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table with columns A1...A1001?



UPD: printSchema() command would show something like this:



root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)


When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.



In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.







sql amazon-web-services join hive union






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 '18 at 16:44







lugger1

















asked Nov 16 '18 at 20:37









lugger1lugger1

63921219




63921219













  • Please provide some short example of JSON and explain why did you create tables with different structure

    – leftjoin
    Nov 17 '18 at 8:06











  • I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

    – lugger1
    Nov 19 '18 at 16:13



















  • Please provide some short example of JSON and explain why did you create tables with different structure

    – leftjoin
    Nov 17 '18 at 8:06











  • I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

    – lugger1
    Nov 19 '18 at 16:13

















Please provide some short example of JSON and explain why did you create tables with different structure

– leftjoin
Nov 17 '18 at 8:06





Please provide some short example of JSON and explain why did you create tables with different structure

– leftjoin
Nov 17 '18 at 8:06













I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

– lugger1
Nov 19 '18 at 16:13





I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

– lugger1
Nov 19 '18 at 16:13












1 Answer
1






active

oldest

votes


















0














1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.



Simple UNION ALL:



select col1,  from table 
UNION ALL
select * from table 2
...


will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.



2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:



select A1, ... A1000, null A1001, null A1002  from table 
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...





share|improve this answer


























  • Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

    – lugger1
    Nov 19 '18 at 21:23












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53345039%2fjoin-tables-with-different-columns-using-hive-sql%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.



Simple UNION ALL:



select col1,  from table 
UNION ALL
select * from table 2
...


will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.



2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:



select A1, ... A1000, null A1001, null A1002  from table 
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...





share|improve this answer


























  • Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

    – lugger1
    Nov 19 '18 at 21:23
















0














1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.



Simple UNION ALL:



select col1,  from table 
UNION ALL
select * from table 2
...


will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.



2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:



select A1, ... A1000, null A1001, null A1002  from table 
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...





share|improve this answer


























  • Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

    – lugger1
    Nov 19 '18 at 21:23














0












0








0







1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.



Simple UNION ALL:



select col1,  from table 
UNION ALL
select * from table 2
...


will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.



2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:



select A1, ... A1000, null A1001, null A1002  from table 
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...





share|improve this answer















1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.



Simple UNION ALL:



select col1,  from table 
UNION ALL
select * from table 2
...


will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.



2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:



select A1, ... A1000, null A1001, null A1002  from table 
UNION ALL
select A1, ... A1000, A1001, null A1002 from table 2
UNION ALL
...






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 19 '18 at 17:27

























answered Nov 19 '18 at 17:21









leftjoinleftjoin

10.4k22356




10.4k22356













  • Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

    – lugger1
    Nov 19 '18 at 21:23



















  • Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

    – lugger1
    Nov 19 '18 at 21:23

















Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

– lugger1
Nov 19 '18 at 21:23





Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

– lugger1
Nov 19 '18 at 21:23




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53345039%2fjoin-tables-with-different-columns-using-hive-sql%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python