Join tables with different columns using Hive sql

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.

Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1 has columns A1....A1000, table2 has columns A1,A3,A100,...A1000, and A1001 (so some columns are common, some are unique for each table).

Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table with columns A1...A1001?

UPD: printSchema() command would show something like this:

root

 |-- aaa: struct (nullable = true)

 |    |-- ids: array (nullable = true)

 |    |    |-- element: struct (containsNull = true)

 |    |    |    |-- bb1: string (nullable = true)

 |    |    |    |-- bb2: string (nullable = true)

 |    |    |    |-- bb3: string (nullable = true)

When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.

In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.

edited Nov 19 '18 at 16:44

asked Nov 16 '18 at 20:37

lugger1

63921219

Please provide some short example of JSON and explain why did you create tables with different structure

– leftjoin
Nov 17 '18 at 8:06

I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

– lugger1
Nov 19 '18 at 16:13

add a comment |

I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.

UPD: printSchema() command would show something like this:

root

 |-- aaa: struct (nullable = true)

 |    |-- ids: array (nullable = true)

 |    |    |-- element: struct (containsNull = true)

 |    |    |    |-- bb1: string (nullable = true)

 |    |    |    |-- bb2: string (nullable = true)

 |    |    |    |-- bb3: string (nullable = true)

edited Nov 19 '18 at 16:44

asked Nov 16 '18 at 20:37

lugger1

63921219

Please provide some short example of JSON and explain why did you create tables with different structure

– leftjoin
Nov 17 '18 at 8:06

I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

– lugger1
Nov 19 '18 at 16:13

add a comment |

I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.

UPD: printSchema() command would show something like this:

root

 |-- aaa: struct (nullable = true)

 |    |-- ids: array (nullable = true)

 |    |    |-- element: struct (containsNull = true)

 |    |    |    |-- bb1: string (nullable = true)

 |    |    |    |-- bb2: string (nullable = true)

 |    |    |    |-- bb3: string (nullable = true)

edited Nov 19 '18 at 16:44

asked Nov 16 '18 at 20:37

lugger1

63921219

I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.

UPD: printSchema() command would show something like this:

root

 |-- aaa: struct (nullable = true)

 |    |-- ids: array (nullable = true)

 |    |    |-- element: struct (containsNull = true)

 |    |    |    |-- bb1: string (nullable = true)

 |    |    |    |-- bb2: string (nullable = true)

 |    |    |    |-- bb3: string (nullable = true)

sql amazon-web-services join hive union

edited Nov 19 '18 at 16:44

asked Nov 16 '18 at 20:37

lugger1

63921219

edited Nov 19 '18 at 16:44

asked Nov 16 '18 at 20:37

lugger1

63921219

edited Nov 19 '18 at 16:44

asked Nov 16 '18 at 20:37

lugger1

63921219

asked Nov 16 '18 at 20:37

lugger1

63921219

asked Nov 16 '18 at 20:37

lugger1

63921219

Please provide some short example of JSON and explain why did you create tables with different structure

– leftjoin
Nov 17 '18 at 8:06

I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

– lugger1
Nov 19 '18 at 16:13

add a comment |

Please provide some short example of JSON and explain why did you create tables with different structure

– leftjoin
Nov 17 '18 at 8:06

I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

– lugger1
Nov 19 '18 at 16:13

Please provide some short example of JSON and explain why did you create tables with different structure

– leftjoin
Nov 17 '18 at 8:06

I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.

– lugger1
Nov 19 '18 at 16:13

add a comment |

1 Answer
1

active

oldest

votes

1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.

Simple UNION ALL:

select col1,  from table 

UNION ALL 

    select * from table 2 

    ...

will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.

2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:

select A1, ... A1000, null A1001, null A1002  from table 

UNION ALL 

select A1, ... A1000, A1001, null A1002 from table 2 

UNION ALL 

...

edited Nov 19 '18 at 17:27

answered Nov 19 '18 at 17:21

leftjoin

10.4k22356

Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

– lugger1
Nov 19 '18 at 21:23

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53345039%2fjoin-tables-with-different-columns-using-hive-sql%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.

Simple UNION ALL:

select col1,  from table 

UNION ALL 

    select * from table 2 

    ...

will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.

2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:

select A1, ... A1000, null A1001, null A1002  from table 

UNION ALL 

select A1, ... A1000, A1001, null A1002 from table 2 

UNION ALL 

...

edited Nov 19 '18 at 17:27

answered Nov 19 '18 at 17:21

leftjoin

10.4k22356

Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

– lugger1
Nov 19 '18 at 21:23

add a comment |

1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.

Simple UNION ALL:

select col1,  from table 

UNION ALL 

    select * from table 2 

    ...

will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.

2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:

select A1, ... A1000, null A1001, null A1002  from table 

UNION ALL 

select A1, ... A1000, A1001, null A1002 from table 2 

UNION ALL 

...

edited Nov 19 '18 at 17:27

answered Nov 19 '18 at 17:21

leftjoin

10.4k22356

Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

– lugger1
Nov 19 '18 at 21:23

add a comment |

1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.

Simple UNION ALL:

select col1,  from table 

UNION ALL 

    select * from table 2 

    ...

will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.

2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:

select A1, ... A1000, null A1001, null A1002  from table 

UNION ALL 

select A1, ... A1000, A1001, null A1002 from table 2 

UNION ALL 

...

edited Nov 19 '18 at 17:27

answered Nov 19 '18 at 17:21

leftjoin

10.4k22356

1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.

Simple UNION ALL:

select col1,  from table 

UNION ALL 

    select * from table 2 

    ...

will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.

2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:

select A1, ... A1000, null A1001, null A1002  from table 

UNION ALL 

select A1, ... A1000, A1001, null A1002 from table 2 

UNION ALL 

...

edited Nov 19 '18 at 17:27

answered Nov 19 '18 at 17:21

leftjoin

10.4k22356

edited Nov 19 '18 at 17:27

answered Nov 19 '18 at 17:21

leftjoin

10.4k22356

answered Nov 19 '18 at 17:21

leftjoin

10.4k22356

answered Nov 19 '18 at 17:21

leftjoin

10.4k22356

Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

– lugger1
Nov 19 '18 at 21:23

add a comment |

Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

– lugger1
Nov 19 '18 at 21:23

Thank you @leftjoin. But what do you mean by 'JSONSerDe based'? Does it require all chunk tables to have all possible columns from all available json files, not only columns from files included in each specific chunk? In my case those chunk tables were created by flattening (with help of pandas module) only part of all json files, namely files included to this specific chunk. I didn't use JSONSerDe, but converting pandas df into Hive table.

– lugger1
Nov 19 '18 at 21:23

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

WLxhdeDKnPuhzkLeuWEGg7Az

搜尋此網誌

Vfrdtyky