I have two cell arrays obtained from a database fetch:
data (R x C) - contains data with columns of doubles and strings
data_fields (C x 1) - contains column names
I want to create a dataset with C variables (each with R values), that are named using the column names in data_fields, so that I can do further processing.
Any suggestions on how to proceed?
I tried an approach involving xlswrite'ing the data to an Excel-formatted file, and using the dataset constructor to read it back in, but the Excel-formatted file is missing column names. If I then want to fix the variable names in the dataset, I basically have to do a replacedata into a new dataset that I've created with blank data, or am I overlooking a ridiculously easy way to rename variables in a dataset?
My data has 86 columns also makes a low-level I/O solution a bit impractical, as fprintf requires format strings for every field, and my SQL query may change as I continue my program.
I would prefer an approach that doesn't involve file I/O.
Using the dataset constructor and hand-specifying the individual columns in the cell array and matching the appropriate variable names is time-consuming, due to there being so many columns, plus it's not the most elegant solution. Would prefer a solution that works no matter how many columns were returned from the original SQL query.
|
|
0
|
|
|
|
Reply
|
Bob
|
11/20/2009 12:10:21 AM |
|
Further reading of the documentation showed how I can use the xlswrite and dataset constructor (with the 'XLSFile' switch) method, in conjunction with
myDataset = set(ds, 'VarNames', columnNames');
to create the dataset I wanted.
However, I'm still curious whether there is a file I/O-free way of taking a cell array and turning it into a dataset.
|
|
0
|
|
|
|
Reply
|
Bob
|
11/20/2009 9:56:19 PM
|
|
"Bob " <bobychan@hotmail.com> wrote in message <he4mpd$sqg$1@fred.mathworks.com>...
> I have two cell arrays obtained from a database fetch:
> data (R x C) - contains data with columns of doubles and strings
> data_fields (C x 1) - contains column names
for i=1:length(data_fields)
S.(data_fields{i})=[data{:,i}];'
end
|
|
0
|
|
|
|
Reply
|
Matt
|
11/20/2009 11:14:06 PM
|
|
"Matt " <xys@whatever.com> wrote in message <he77ru$7te$1@fred.mathworks.com>...
> "Bob " <bobychan@hotmail.com> wrote in message <he4mpd$sqg$1@fred.mathworks.com>...
> > I have two cell arrays obtained from a database fetch:
> > data (R x C) - contains data with columns of doubles and strings
> > data_fields (C x 1) - contains column names
>
> for i=1:length(data_fields)
>
> S.(data_fields{i})=[data{:,i}];'
>
> end
How do you initialize S so that Matlab knows it's a dataset.
If I run the code as is - i.e. starting with just data and data_fields, Matlab creates S but it's a 1x1 struct (with a bunch of data stuffed inside).
If I add:
S = dataset();
Matlab throws a bunch of errors and I get a 42575x86 dataset instead of a 655x86 dataset like I'm supposed to.
|
|
0
|
|
|
|
Reply
|
Bob
|
11/20/2009 11:52:03 PM
|
|
"Bob " <bobychan@hotmail.com> wrote in message <he7a33$ncn$1@fred.mathworks.com>...
> How do you initialize S so that Matlab knows it's a dataset.
> If I run the code as is - i.e. starting with just data and data_fields, Matlab creates S but it's a 1x1 struct (with a bunch of data stuffed inside).
> If I add:
>
> S = dataset();
>
> Matlab throws a bunch of errors and I get a 42575x86 dataset instead of a 655x86 dataset like I'm supposed to.
You need to clarify what you mean by a "dataset", in particular what kind of MATLAB variable type (cell, struct, numeric,etc...) that you want to use to represent it.
As it stands, your data contains both doubles and strings. There is no MATLAB variable type apart from a cell array that lets you hold both strings and numeric type in a common rectangular array. There is also no variable type apart from struct that allows you to assign different names to different parts of the data.
|
|
0
|
|
|
|
Reply
|
Matt
|
11/21/2009 8:16:27 PM
|
|
Dear Matt!
> You need to clarify what you mean by a "dataset", in particular what kind of MATLAB variable type (cell, struct, numeric,etc...) that you want to use to represent it.
DATASET is part of the statistics toolbox.
Kind regards, Jan
|
|
0
|
|
|
|
Reply
|
Jan
|
11/22/2009 12:09:03 PM
|
|
"Bob " <bobychan@hotmail.com> wrote in message <he73a3$q70$1@fred.mathworks.com>...
> Further reading of the documentation showed how I can use the xlswrite and dataset constructor (with the 'XLSFile' switch) method, in conjunction with
>
> myDataset = set(ds, 'VarNames', columnNames');
>
> to create the dataset I wanted.
>
> However, I'm still curious whether there is a file I/O-free way of taking a cell array and turning it into a dataset.
As far as I know, no! So I also use the I/O-way. However, I'm using the "export" method for datasets and txt files. For my data, that's faster and produces smaller files. Perhaps this helps.
>> data
data =
[0.6177] [0.2399] 'string' [0.6820]
[0.8594] [0.8865] 'string' [0.0424]
[0.8055] [0.0287] 'string' [0.0714]
[0.5767] [0.4899] 'string' [0.5216]
[0.1829] [0.1679] 'string' [0.0967]
>> s=dataset(data)
s =
data
[1x4 cell]
[1x4 cell]
[1x4 cell]
[1x4 cell]
[1x4 cell]
>> export(s)
>> s=dataset('file', 's.txt','VarNames', data_fields)
s =
col1 col2 col3 col4
0.61767 0.23993 'string' 0.68197
0.85944 0.88651 'string' 0.042431
0.80549 0.028674 'string' 0.071445
0.57672 0.4899 'string' 0.52165
0.18292 0.16793 'string' 0.09673
>> delete('s.txt')
|
|
0
|
|
|
|
Reply
|
Anna
|
11/22/2009 1:40:28 PM
|
|
Bob wrote:
> I have two cell arrays obtained from a database fetch:
> data (R x C) - contains data with columns of doubles and strings
> data_fields (C x 1) - contains column names
>
> I want to create a dataset with C variables (each with R values), that are named using the column names in data_fields, so that I can do further processing.
Bob, using file I/O is probably overthinking this. Here's what I'd do:
>> c = {1 4 'a' 7; 2 5 'b' 8; 3 6 'c' 9}
c =
[1] [4] 'a' [7]
[2] [5] 'b' [8]
[3] [6] 'c' [9]
>> names = {'w' 'x' 'y' 'z'};
There are two problems with these data, relative to the form you ultimately want them in:
1) they're all in a single cell array, and
2) the numeric data are stored in cells, not as numeric vectors.
The first problem is easy, there's a syntax for the dataset constructor specifically for converting a single array into multiple dataset variables. It takes the array and then a list of names, one for each column in the array. Here, I've used {:} to expand out the cell array of strings that contains the variable names:
>> d = dataset({c,names{:}})
d =
w x y z
[1] [4] 'a' [7]
[2] [5] 'b' [8]
[3] [6] 'c' [9]
Now you almost certainly want to give those numeric variables the "right type" -- they're still stored as cell vectors. Here's the most direct way:
>> d.w = cell2mat(d.w);
d.x = cell2mat(d.x);
d.z = cell2mat(d.z);
d
d =
w x y z
1 4 'a' 7
2 5 'b' 8
3 6 'c' 9
To automate that last step, you can pretty easily cook up a call to the DATASETFUN method (like the CELLFUN and STRUCTFUN functions).
Now you can do things like
>> log(d.x)
ans =
1.3863
1.6094
1.7918
or even
>> d.logx = log(d.x)
d =
w x y z logx
1 4 'a' 7 1.3863
2 5 'b' 8 1.6094
3 6 'c' 9 1.7918
The above is only one way. Another might be to do the cell->numeric conversion first, storing temp vars in the workspace or in a scalar structure. That might be harder to automate.
By the way, if you really _wanted_ to use files, there are straight-forward ways to
read the variables names from a file (test or Excel), and simple ways to change the variables names in an existing dataset array. For example,
>> d.Properties.VarNames = {'W' 'X' 'Y' 'Z'}
d =
W X Y Z
1 4 'a' 7
2 5 'b' 8
3 6 'c' 9
That one is perhaps easy to miss in the documentation.
Hope this helps.
|
|
0
|
|
|
|
Reply
|
Peter.Perkins (345)
|
11/22/2009 4:22:20 PM
|
|
Matt wrote:
> As it stands, your data contains both doubles and strings. There is no MATLAB variable type apart from a cell array that lets you hold both strings and numeric type in a common rectangular array. There is also no variable type apart from struct that allows you to assign different names to different parts of the data.
Matt, strictly speaking that's true in core MATLAB, but Bob is referring to the dataset array that is part of the Statistics Toolbox. Using that, and your code
>> data = {1 4 'a' 7; 2 5 'b' 8; 3 6 'c' 9};
>> data_fields = {'w' 'x' 'y' 'z'};
>> S = dataset()
S =
[empty 0-by-0 dataset]
>> for i=1:length(data_fields), S.(data_fields{i})=[data{:,i}]'; end
>> S
S =
w x y z
1 4 a 7
2 5 b 8
3 6 c 9
Bob, I'm not sure why you would have gotten errors and unexpected results. Is the above what you did?
Note that the above code creates a dataset array in which "y" is a char column vector. In my other reply to this thread, "y" ended up as a cell array of strings. Either is possible, it's just that the above code ends up this way:
>> summary(S)
w: [3x1 double]
min 1st Q median 3rd Q max
1 1.25 2 2.75 3
x: [3x1 double]
min 1st Q median 3rd Q max
4 4.25 5 5.75 6
y: [3x1 char]
z: [3x1 double]
min 1st Q median 3rd Q max
7 7.25 8 8.75 9
Hope this helps.
|
|
0
|
|
|
|
Reply
|
Peter.Perkins (345)
|
11/22/2009 4:22:24 PM
|
|
Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message <hebog0$mn1$2@fred.mathworks.com>...
> Matt wrote:
>
> > As it stands, your data contains both doubles and strings. There is no MATLAB variable type apart from a cell array that lets you hold both strings and numeric type in a common rectangular array. There is also no variable type apart from struct that allows you to assign different names to different parts of the data.
>
> Matt, strictly speaking that's true in core MATLAB, but Bob is referring to the dataset array that is part of the Statistics Toolbox. Using that, and your code
>
> >> data = {1 4 'a' 7; 2 5 'b' 8; 3 6 'c' 9};
> >> data_fields = {'w' 'x' 'y' 'z'};
> >> S = dataset()
> S =
> [empty 0-by-0 dataset]
> >> for i=1:length(data_fields), S.(data_fields{i})=[data{:,i}]'; end
> >> S
> S =
> w x y z
> 1 4 a 7
> 2 5 b 8
> 3 6 c 9
>
> Bob, I'm not sure why you would have gotten errors and unexpected results. Is the above what you did?
It may have something to do with the fact that I had the semicolon preceding the transpose ' in the code I posted.
|
|
0
|
|
|
|
Reply
|
xys (1072)
|
11/22/2009 4:43:04 PM
|
|
Thank you to all of the people who have responded to my question.
Peter and Matt,
I broke the first rule of forum question-asking - not posting sample code. Thank you for correcting my error. To expand on the example you so kindly provided, I ran your code and got the same results (a 3x4 dataset), as expected. However, I used it on my own "real" data and data_fields, and got completely different results.
I have replicated the problem with the following code:
data = {1 4 'ab' 7; 2 5 'cd' 8; 3 6 'ef' 9};
data_fields = {'w' 'x' 'y' 'z'};
S = dataset();
for i=1:length(data_fields)
S.(data_fields{i})=[data{:,i}]';
end
This should give a dataset with the same dimensions - 3 rows, 4 columns. However, it actually gives a 6x4 dataset due to the two-character strings being expanded to separate rows.
S =
w x y z
1 4 a 7
2 5 b 8
3 6 c 9
0 0 d 0
0 0 e 0
0 0 f 0
Needless to say, this is unexpected behavior and I am unsure whether I do not sufficiently understand cell arrays, or if this is an actual bug in the language. I'm leaning towards a bug because if this behavior was introduced by design, I would think that the integer rows would have been duplicated as the strings in that row were expanded, rather than the dataset being padded with zeros, e.g.
S =
w x y z
1 4 a 7
1 4 b 7
2 5 c 8
2 5 d 8
3 6 e 9
3 6 f 9
The approach outlined in Peter's first reply:
S = dataset({data, data_fields{:}});
does work better, but requires an additional for loop to check each dataset variable and change numeric cell vectors to the "right type" with cell2mat while leaving the string variables untouched.
|
|
0
|
|
|
|
Reply
|
Bob
|
11/23/2009 7:48:03 PM
|
|
Bob wrote:
> I have replicated the problem with the following code:
>
> data = {1 4 'ab' 7; 2 5 'cd' 8; 3 6 'ef' 9};
> data_fields = {'w' 'x' 'y' 'z'};
> S = dataset();
> for i=1:length(data_fields)
> S.(data_fields{i})=[data{:,i}]';
> end
>
> This should give a dataset with the same dimensions - 3 rows, 4 columns. However, it actually gives a 6x4 dataset due to the two-character strings being expanded to separate rows.
>
> S =
> w x y z
> 1 4 a 7
> 2 5 b 8
> 3 6 c 9
> 0 0 d 0
> 0 0 e 0
> 0 0 f 0
>
> Needless to say, this is unexpected behavior and I am unsure whether I do not sufficiently understand cell arrays, or if this is an actual bug in the language. I'm leaning towards a bug because if this behavior was introduced by design, I would think that the integer rows would have been duplicated as the strings in that row were expanded, rather than the dataset being padded with zeros, e.g.
Not a bug, and only partially having to do with the finer points of cell arrays. It's this line here:
S.(data_fields{i})=[data{:,i}]';
that blindly takes {'ab'; 'cd'; 'ef'}, and turns it into ['abcdef']', a 6x1 column vector. When you assign that as a new variable into the existing dataset array, you implicitly add rows with default values for the other variables (i.e., zeros). This is essentially the same thing that would happen if you assigned to the (6,4)th element of a 4x2 double array, by the way.
So the code I wrote was simple, but not flexible enough for your needs. You'll have to improve on it. Put in an if test, and no worries.
|
|
0
|
|
|
|
Reply
|
Peter
|
11/23/2009 8:58:26 PM
|
|
Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message <heet1i$5u8$1@fred.mathworks.com>...
> Bob wrote:
>
> > I have replicated the problem with the following code:
> >
> > data = {1 4 'ab' 7; 2 5 'cd' 8; 3 6 'ef' 9};
> > data_fields = {'w' 'x' 'y' 'z'};
> > S = dataset();
> > for i=1:length(data_fields)
> > S.(data_fields{i})=[data{:,i}]';
> > end
> >
> > This should give a dataset with the same dimensions - 3 rows, 4 columns. However, it actually gives a 6x4 dataset due to the two-character strings being expanded to separate rows.
> >
> > S =
> > w x y z
> > 1 4 a 7
> > 2 5 b 8
> > 3 6 c 9
> > 0 0 d 0
> > 0 0 e 0
> > 0 0 f 0
> >
> > Needless to say, this is unexpected behavior and I am unsure whether I do not sufficiently understand cell arrays, or if this is an actual bug in the language. I'm leaning towards a bug because if this behavior was introduced by design, I would think that the integer rows would have been duplicated as the strings in that row were expanded, rather than the dataset being padded with zeros, e.g.
>
> Not a bug, and only partially having to do with the finer points of cell arrays. It's this line here:
>
> S.(data_fields{i})=[data{:,i}]';
>
> that blindly takes {'ab'; 'cd'; 'ef'}, and turns it into ['abcdef']', a 6x1 column vector. When you assign that as a new variable into the existing dataset array, you implicitly add rows with default values for the other variables (i.e., zeros). This is essentially the same thing that would happen if you assigned to the (6,4)th element of a 4x2 double array, by the way.
>
> So the code I wrote was simple, but not flexible enough for your needs. You'll have to improve on it. Put in an if test, and no worries.
On the subject of arrays, how would this work if there were more than two cell arrays?
|
|
0
|
|
|
|
Reply
|
CyberFrog
|
4/19/2010 3:07:05 AM
|
|
|
12 Replies
858 Views
(page loaded in 0.231 seconds)
|