cleaning raw text files with vertical bars and converting them into csv

up vote
0
down vote

favorite

I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.

      |group call| pvt call |phone call|group busy| pvt busy |phone busy|     

 time |total     |total     |total     |total     |total     |total     |  %  

period| sec  cnt | sec   cnt| sec   cnt| sec   cnt| sec   cnt| sec   cnt | usage



00:00 | 4323  548|    0    0|    0    0|    0    0|    0    0|    0    0| 18%



00:15 | 4125  479|    0    0|    0    0|    0    0|    0    0|    0    0| 17%



00:30 | 3071  395|    0    0|    0    0|    0    0|    0    0|    0    0| 13%



00:45 | 3514  447|    0    0|    0    0|    0    0|    0    0|    0    0| 14%



01:00 | 3081  383|    0    0|    0    0|    0    0|    0    0|    0    0| 13%

I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:

time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage

00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%

00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%

00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%

00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%

01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%

01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%

01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%

edited Nov 11 at 4:22

Eray Balkanli

3,82041943

asked Nov 11 at 4:19

Hilde Schneider

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

add a comment |

up vote
0
down vote

favorite

I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.

      |group call| pvt call |phone call|group busy| pvt busy |phone busy|     

 time |total     |total     |total     |total     |total     |total     |  %  

period| sec  cnt | sec   cnt| sec   cnt| sec   cnt| sec   cnt| sec   cnt | usage



00:00 | 4323  548|    0    0|    0    0|    0    0|    0    0|    0    0| 18%



00:15 | 4125  479|    0    0|    0    0|    0    0|    0    0|    0    0| 17%



00:30 | 3071  395|    0    0|    0    0|    0    0|    0    0|    0    0| 13%



00:45 | 3514  447|    0    0|    0    0|    0    0|    0    0|    0    0| 14%



01:00 | 3081  383|    0    0|    0    0|    0    0|    0    0|    0    0| 13%

I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:

time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage

00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%

00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%

00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%

00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%

01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%

01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%

01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%

edited Nov 11 at 4:22

Eray Balkanli

3,82041943

asked Nov 11 at 4:19

Hilde Schneider

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

add a comment |

up vote
0
down vote

favorite

I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.

      |group call| pvt call |phone call|group busy| pvt busy |phone busy|     

 time |total     |total     |total     |total     |total     |total     |  %  

period| sec  cnt | sec   cnt| sec   cnt| sec   cnt| sec   cnt| sec   cnt | usage



00:00 | 4323  548|    0    0|    0    0|    0    0|    0    0|    0    0| 18%



00:15 | 4125  479|    0    0|    0    0|    0    0|    0    0|    0    0| 17%



00:30 | 3071  395|    0    0|    0    0|    0    0|    0    0|    0    0| 13%



00:45 | 3514  447|    0    0|    0    0|    0    0|    0    0|    0    0| 14%



01:00 | 3081  383|    0    0|    0    0|    0    0|    0    0|    0    0| 13%

I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:

time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage

00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%

00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%

00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%

00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%

01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%

01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%

01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%

edited Nov 11 at 4:22

Eray Balkanli

3,82041943

asked Nov 11 at 4:19

Hilde Schneider

I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.

      |group call| pvt call |phone call|group busy| pvt busy |phone busy|     

 time |total     |total     |total     |total     |total     |total     |  %  

period| sec  cnt | sec   cnt| sec   cnt| sec   cnt| sec   cnt| sec   cnt | usage



00:00 | 4323  548|    0    0|    0    0|    0    0|    0    0|    0    0| 18%



00:15 | 4125  479|    0    0|    0    0|    0    0|    0    0|    0    0| 17%



00:30 | 3071  395|    0    0|    0    0|    0    0|    0    0|    0    0| 13%



00:45 | 3514  447|    0    0|    0    0|    0    0|    0    0|    0    0| 14%



01:00 | 3081  383|    0    0|    0    0|    0    0|    0    0|    0    0| 13%

I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:

time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage

00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%

00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%

00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%

00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%

01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%

01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%

01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%

python r pandas csv

edited Nov 11 at 4:22

Eray Balkanli

3,82041943

asked Nov 11 at 4:19

Hilde Schneider

edited Nov 11 at 4:22

Eray Balkanli

3,82041943

asked Nov 11 at 4:19

Hilde Schneider

edited Nov 11 at 4:22

Eray Balkanli

3,82041943

edited Nov 11 at 4:22

Eray Balkanli

3,82041943

edited Nov 11 at 4:22

Eray Balkanli

3,82041943

asked Nov 11 at 4:19

Hilde Schneider

asked Nov 11 at 4:19

Hilde Schneider

asked Nov 11 at 4:19

Hilde Schneider

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

add a comment |

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)

#      0     1    2   3   4   5   6   7   8   9   10  11  12   13

#0  00:00  4323  548   0   0   0   0   0   0   0   0   0   0  18%

#1  00:15  4125  479   0   0   0   0   0   0   0   0   0   0  17%

#2  00:30  3071  395   0   0   0   0   0   0   0   0   0   0  13%

#3  00:45  3514  447   0   0   0   0   0   0   0   0   0   0  14%

#4  01:00  3081  383   0   0   0   0   0   0   0   0   0   0  13%

data.columns = "time_pd","group_call_t_s","group_call_t_c",...

data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.2k61948

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245801%2fcleaning-raw-text-files-with-vertical-bars-and-converting-them-into-csv%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)

#      0     1    2   3   4   5   6   7   8   9   10  11  12   13

#0  00:00  4323  548   0   0   0   0   0   0   0   0   0   0  18%

#1  00:15  4125  479   0   0   0   0   0   0   0   0   0   0  17%

#2  00:30  3071  395   0   0   0   0   0   0   0   0   0   0  13%

#3  00:45  3514  447   0   0   0   0   0   0   0   0   0   0  14%

#4  01:00  3081  383   0   0   0   0   0   0   0   0   0   0  13%

data.columns = "time_pd","group_call_t_s","group_call_t_c",...

data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.2k61948

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

up vote
0
down vote

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)

#      0     1    2   3   4   5   6   7   8   9   10  11  12   13

#0  00:00  4323  548   0   0   0   0   0   0   0   0   0   0  18%

#1  00:15  4125  479   0   0   0   0   0   0   0   0   0   0  17%

#2  00:30  3071  395   0   0   0   0   0   0   0   0   0   0  13%

#3  00:45  3514  447   0   0   0   0   0   0   0   0   0   0  14%

#4  01:00  3081  383   0   0   0   0   0   0   0   0   0   0  13%

data.columns = "time_pd","group_call_t_s","group_call_t_c",...

data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.2k61948

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

up vote
0
down vote

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)

#      0     1    2   3   4   5   6   7   8   9   10  11  12   13

#0  00:00  4323  548   0   0   0   0   0   0   0   0   0   0  18%

#1  00:15  4125  479   0   0   0   0   0   0   0   0   0   0  17%

#2  00:30  3071  395   0   0   0   0   0   0   0   0   0   0  13%

#3  00:45  3514  447   0   0   0   0   0   0   0   0   0   0  14%

#4  01:00  3081  383   0   0   0   0   0   0   0   0   0   0  13%

data.columns = "time_pd","group_call_t_s","group_call_t_c",...

data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.2k61948

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)

#      0     1    2   3   4   5   6   7   8   9   10  11  12   13

#0  00:00  4323  548   0   0   0   0   0   0   0   0   0   0  18%

#1  00:15  4125  479   0   0   0   0   0   0   0   0   0   0  17%

#2  00:30  3071  395   0   0   0   0   0   0   0   0   0   0  13%

#3  00:45  3514  447   0   0   0   0   0   0   0   0   0   0  14%

#4  01:00  3081  383   0   0   0   0   0   0   0   0   0   0  13%

data.columns = "time_pd","group_call_t_s","group_call_t_c",...

data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.2k61948

answered Nov 11 at 4:38

DYZ

24.2k61948

answered Nov 11 at 4:38

DYZ

24.2k61948

answered Nov 11 at 4:38

DYZ

24.2k61948

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

DUb5rxwl9mZv8XT,io6cvSV,cXhTvHHf3qWrhUg 3fJXP,y0flEJtI

搜尋此網誌

Vfrdtyky