cleaning raw text files with vertical bars and converting them into csv
up vote
0
down vote
favorite
I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.
|group call| pvt call |phone call|group busy| pvt busy |phone busy|
time |total |total |total |total |total |total | %
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage
00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%
00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%
00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%
00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%
01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%
I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:
time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%
python r pandas csv
add a comment |
up vote
0
down vote
favorite
I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.
|group call| pvt call |phone call|group busy| pvt busy |phone busy|
time |total |total |total |total |total |total | %
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage
00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%
00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%
00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%
00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%
01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%
I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:
time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%
python r pandas csv
If you can guarantee that there will not be embedded spaces in any of the fields, then trygsub('[ |]+', ',', readLines(filename))
for basicsep=
conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.
|group call| pvt call |phone call|group busy| pvt busy |phone busy|
time |total |total |total |total |total |total | %
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage
00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%
00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%
00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%
00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%
01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%
I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:
time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%
python r pandas csv
I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.
|group call| pvt call |phone call|group busy| pvt busy |phone busy|
time |total |total |total |total |total |total | %
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage
00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%
00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%
00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%
00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%
01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%
I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:
time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%
python r pandas csv
python r pandas csv
edited Nov 11 at 4:22
Eray Balkanli
3,82041943
3,82041943
asked Nov 11 at 4:19
Hilde Schneider
53
53
If you can guarantee that there will not be embedded spaces in any of the fields, then trygsub('[ |]+', ',', readLines(filename))
for basicsep=
conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21
add a comment |
If you can guarantee that there will not be embedded spaces in any of the fields, then trygsub('[ |]+', ',', readLines(filename))
for basicsep=
conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21
If you can guarantee that there will not be embedded spaces in any of the fields, then try
gsub('[ |]+', ',', readLines(filename))
for basic sep=
conversion. Your column naming is a bit more work, though.– r2evans
Nov 11 at 4:21
If you can guarantee that there will not be embedded spaces in any of the fields, then try
gsub('[ |]+', ',', readLines(filename))
for basic sep=
conversion. Your column naming is a bit more work, though.– r2evans
Nov 11 at 4:21
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Python
If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:
data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)
I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Python
If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:
data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)
I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13
add a comment |
up vote
0
down vote
Python
If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:
data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)
I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13
add a comment |
up vote
0
down vote
up vote
0
down vote
Python
If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:
data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)
Python
If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:
data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)
answered Nov 11 at 4:38
DYZ
24.2k61948
24.2k61948
I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13
add a comment |
I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13
I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13
I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245801%2fcleaning-raw-text-files-with-vertical-bars-and-converting-them-into-csv%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If you can guarantee that there will not be embedded spaces in any of the fields, then try
gsub('[ |]+', ',', readLines(filename))
for basicsep=
conversion. Your column naming is a bit more work, though.– r2evans
Nov 11 at 4:21