Converting a cell array into a dataset

  • Follow


I have two cell arrays obtained from a database fetch:
data (R x C) - contains data with columns of doubles and strings
data_fields (C x 1) - contains column names

I want to create a dataset with C variables (each with R values), that are named using the column names in data_fields, so that I can do further processing.

Any suggestions on how to proceed?

I tried an approach involving xlswrite'ing the data to an Excel-formatted file, and using the dataset constructor to read it back in, but the Excel-formatted file is missing column names. If I then want to fix the variable names in the dataset, I basically have to do a replacedata into a new dataset that I've created with blank data, or am I overlooking a ridiculously easy way to rename variables in a dataset?

My data has 86 columns also makes a low-level I/O solution a bit impractical, as fprintf requires format strings for every field, and my SQL query may change as I continue my program.

I would prefer an approach that doesn't involve file I/O.

Using the dataset constructor and hand-specifying the individual columns in the cell array and matching the appropriate variable names is time-consuming, due to there being so many columns, plus it's not the most elegant solution. Would prefer a solution that works no matter how many columns were returned from the original SQL query.
0
Reply Bob 11/20/2009 12:10:21 AM

Further reading of the documentation showed how I can use the xlswrite and dataset constructor (with the 'XLSFile' switch) method, in conjunction with

myDataset = set(ds, 'VarNames', columnNames');

to create the dataset I wanted.

However, I'm still curious whether there is a file I/O-free way of taking a cell array and turning it into a dataset.
0
Reply Bob 11/20/2009 9:56:19 PM


"Bob " <bobychan@hotmail.com> wrote in message <he4mpd$sqg$1@fred.mathworks.com>...
> I have two cell arrays obtained from a database fetch:
> data (R x C) - contains data with columns of doubles and strings
> data_fields (C x 1) - contains column names

for i=1:length(data_fields)

S.(data_fields{i})=[data{:,i}];'

end
0
Reply Matt 11/20/2009 11:14:06 PM

"Matt " <xys@whatever.com> wrote in message <he77ru$7te$1@fred.mathworks.com>...
> "Bob " <bobychan@hotmail.com> wrote in message <he4mpd$sqg$1@fred.mathworks.com>...
> > I have two cell arrays obtained from a database fetch:
> > data (R x C) - contains data with columns of doubles and strings
> > data_fields (C x 1) - contains column names
> 
> for i=1:length(data_fields)
> 
> S.(data_fields{i})=[data{:,i}];'
> 
> end

How do you initialize S so that Matlab knows it's a dataset.
If I run the code as is - i.e. starting with just data and data_fields, Matlab creates S but it's a 1x1 struct (with a bunch of data stuffed inside).
If I add:

S = dataset();

Matlab throws a bunch of errors and I get a 42575x86 dataset instead of a 655x86 dataset like I'm supposed to.
0
Reply Bob 11/20/2009 11:52:03 PM

"Bob " <bobychan@hotmail.com> wrote in message <he7a33$ncn$1@fred.mathworks.com>...

> How do you initialize S so that Matlab knows it's a dataset.
> If I run the code as is - i.e. starting with just data and data_fields, Matlab creates S but it's a 1x1 struct (with a bunch of data stuffed inside).
> If I add:
> 
> S = dataset();
> 
> Matlab throws a bunch of errors and I get a 42575x86 dataset instead of a 655x86 dataset like I'm supposed to.

You need to clarify what you mean by a "dataset", in particular what kind of MATLAB variable type (cell, struct, numeric,etc...) that you want to use to represent it.

As it stands, your data contains both doubles and strings. There is no MATLAB variable type apart from a cell array that lets you hold both strings and numeric type in a common rectangular array. There is also no variable type apart from struct that allows you to assign different names to different parts of the data.
0
Reply Matt 11/21/2009 8:16:27 PM

Dear Matt!

> You need to clarify what you mean by a "dataset", in particular what kind of MATLAB variable type (cell, struct, numeric,etc...) that you want to use to represent it.

DATASET is part of the statistics toolbox.

Kind regards, Jan
0
Reply Jan 11/22/2009 12:09:03 PM

"Bob " <bobychan@hotmail.com> wrote in message <he73a3$q70$1@fred.mathworks.com>...
> Further reading of the documentation showed how I can use the xlswrite and dataset constructor (with the 'XLSFile' switch) method, in conjunction with
> 
> myDataset = set(ds, 'VarNames', columnNames');
> 
> to create the dataset I wanted.
> 
> However, I'm still curious whether there is a file I/O-free way of taking a cell array and turning it into a dataset.

As far as I know, no! So I also use the I/O-way. However, I'm using the "export" method for datasets and txt files. For my data, that's faster and produces smaller files. Perhaps this helps. 

>> data

data = 

    [0.6177]    [0.2399]    'string'    [0.6820]
    [0.8594]    [0.8865]    'string'    [0.0424]
    [0.8055]    [0.0287]    'string'    [0.0714]
    [0.5767]    [0.4899]    'string'    [0.5216]
    [0.1829]    [0.1679]    'string'    [0.0967]

>> s=dataset(data)

s = 

    data      
    [1x4 cell]
    [1x4 cell]
    [1x4 cell]
    [1x4 cell]
    [1x4 cell]

>> export(s)
>> s=dataset('file', 's.txt','VarNames', data_fields)

s = 

    col1       col2        col3            col4    
    0.61767     0.23993    'string'         0.68197
    0.85944     0.88651    'string'        0.042431
    0.80549    0.028674    'string'        0.071445
    0.57672      0.4899    'string'         0.52165
    0.18292     0.16793    'string'         0.09673

>> delete('s.txt')
0
Reply Anna 11/22/2009 1:40:28 PM

Bob wrote:
> I have two cell arrays obtained from a database fetch:
> data (R x C) - contains data with columns of doubles and strings
> data_fields (C x 1) - contains column names
> 
> I want to create a dataset with C variables (each with R values), that are named using the column names in data_fields, so that I can do further processing.

Bob, using file I/O is probably overthinking this.  Here's what I'd do:

>> c = {1 4 'a' 7; 2 5 'b' 8; 3 6 'c' 9}
c = 
    [1]    [4]    'a'    [7]
    [2]    [5]    'b'    [8]
    [3]    [6]    'c'    [9]
>> names = {'w' 'x' 'y' 'z'};

There are two problems with these data, relative to the form you ultimately want them in:

1) they're all in a single cell array, and
2) the numeric data are stored in cells, not as numeric vectors.

The first problem is easy, there's a syntax for the dataset constructor specifically for converting a single array into multiple dataset variables.  It takes the array and then a list of names, one for each column in the array.  Here, I've used {:} to expand out the cell array of strings that contains the variable names:

>> d = dataset({c,names{:}})
d = 
    w          x          y          z      
    [1]        [4]        'a'        [7]    
    [2]        [5]        'b'        [8]    
    [3]        [6]        'c'        [9]    

Now you almost certainly want to give those numeric variables the "right type" -- they're still stored as cell vectors.  Here's the most direct way:

>> d.w = cell2mat(d.w);
d.x = cell2mat(d.x);
d.z = cell2mat(d.z);
d
d = 
    w    x    y          z
    1    4    'a'        7
    2    5    'b'        8
    3    6    'c'        9

To automate that last step, you can pretty easily cook up a call to the DATASETFUN method (like the CELLFUN and STRUCTFUN functions).

Now you can do things like

>> log(d.x)
ans =
       1.3863
       1.6094
       1.7918

or even

>> d.logx = log(d.x)
d = 
    w    x    y          z    logx  
    1    4    'a'        7    1.3863
    2    5    'b'        8    1.6094
    3    6    'c'        9    1.7918


The above is only one way.  Another might be to do the cell->numeric conversion first, storing temp vars in the workspace or in a scalar structure.  That might be harder to automate.

By the way, if you really _wanted_ to use files, there are straight-forward ways to 
read the variables names from a file (test or Excel), and simple ways to change the variables names in an existing dataset array.  For example,

>> d.Properties.VarNames = {'W' 'X' 'Y' 'Z'}
d = 
    W    X    Y          Z
    1    4    'a'        7
    2    5    'b'        8
    3    6    'c'        9

That one is perhaps easy to miss in the documentation.

Hope this helps.
0
Reply Peter.Perkins (345) 11/22/2009 4:22:20 PM

Matt wrote:

> As it stands, your data contains both doubles and strings. There is no MATLAB variable type apart from a cell array that lets you hold both strings and numeric type in a common rectangular array. There is also no variable type apart from struct that allows you to assign different names to different parts of the data.

Matt, strictly speaking that's true in core MATLAB, but Bob is referring to the dataset array that is part of the Statistics Toolbox.  Using that, and your code

>> data = {1 4 'a' 7; 2 5 'b' 8; 3 6 'c' 9};
>> data_fields = {'w' 'x' 'y' 'z'};
>> S = dataset()
S = 
[empty 0-by-0 dataset]
>> for i=1:length(data_fields), S.(data_fields{i})=[data{:,i}]'; end
>> S
S = 
    w    x    y    z
    1    4    a    7
    2    5    b    8
    3    6    c    9

Bob, I'm not sure why you would have gotten errors and unexpected results.  Is the above what you did?

Note that the above code creates a dataset array in which "y" is a char column vector.  In my other reply to this thread, "y" ended up as a cell array of strings.  Either is possible, it's just that the above code ends up this way:

>> summary(S)
w: [3x1 double]
    min    1st Q       median    3rd Q       max
    1      1.25        2         2.75        3  
x: [3x1 double]
    min    1st Q       median    3rd Q       max
    4      4.25        5         5.75        6  
y: [3x1 char]
z: [3x1 double]
    min    1st Q       median    3rd Q       max
    7      7.25        8         8.75        9  

Hope this helps.
0
Reply Peter.Perkins (345) 11/22/2009 4:22:24 PM

Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message <hebog0$mn1$2@fred.mathworks.com>...
> Matt wrote:
> 
> > As it stands, your data contains both doubles and strings. There is no MATLAB variable type apart from a cell array that lets you hold both strings and numeric type in a common rectangular array. There is also no variable type apart from struct that allows you to assign different names to different parts of the data.
> 
> Matt, strictly speaking that's true in core MATLAB, but Bob is referring to the dataset array that is part of the Statistics Toolbox.  Using that, and your code
> 
> >> data = {1 4 'a' 7; 2 5 'b' 8; 3 6 'c' 9};
> >> data_fields = {'w' 'x' 'y' 'z'};
> >> S = dataset()
> S = 
> [empty 0-by-0 dataset]
> >> for i=1:length(data_fields), S.(data_fields{i})=[data{:,i}]'; end
> >> S
> S = 
>     w    x    y    z
>     1    4    a    7
>     2    5    b    8
>     3    6    c    9
> 
> Bob, I'm not sure why you would have gotten errors and unexpected results.  Is the above what you did?

It may have something to do with the fact that I had the semicolon preceding the transpose ' in the code I posted.
0
Reply xys (1072) 11/22/2009 4:43:04 PM

Thank you to all of the people who have responded to my question.

Peter and Matt,

I broke the first rule of forum question-asking - not posting sample code. Thank you for correcting my error. To expand on the example you so kindly provided, I ran your code and got the same results (a 3x4 dataset), as expected. However, I used it on my own "real" data and data_fields, and got completely different results.

I have replicated the problem with the following code:

  data = {1 4 'ab' 7; 2 5 'cd' 8; 3 6 'ef' 9};
  data_fields = {'w' 'x' 'y' 'z'};
  S = dataset();
  for i=1:length(data_fields)
    S.(data_fields{i})=[data{:,i}]';
  end

This should give a dataset with the same dimensions - 3 rows, 4 columns. However, it actually gives a 6x4 dataset due to the two-character strings being expanded to separate rows.

S = 
    w    x    y    z
    1    4    a    7
    2    5    b    8
    3    6    c    9
    0    0    d    0
    0    0    e    0
    0    0    f    0

Needless to say, this is unexpected behavior and I am unsure whether I do not sufficiently understand cell arrays, or if this is an actual bug in the language. I'm leaning towards a bug because if this behavior was introduced by design, I would think that the integer rows would have been duplicated as the strings in that row were expanded, rather than the dataset being padded with zeros, e.g.

S = 
    w    x    y    z
    1    4    a    7
    1    4    b    7
    2    5    c    8
    2    5    d    8
    3    6    e    9
    3    6    f    9

The approach outlined in Peter's first reply:
S = dataset({data, data_fields{:}});

does work better, but requires an additional for loop to check each dataset variable and change numeric cell vectors to the "right type" with cell2mat while leaving the string variables untouched.
0
Reply Bob 11/23/2009 7:48:03 PM

Bob wrote:

> I have replicated the problem with the following code:
> 
>   data = {1 4 'ab' 7; 2 5 'cd' 8; 3 6 'ef' 9};
>   data_fields = {'w' 'x' 'y' 'z'};
>   S = dataset();
>   for i=1:length(data_fields)
>     S.(data_fields{i})=[data{:,i}]';
>   end
> 
> This should give a dataset with the same dimensions - 3 rows, 4 columns. However, it actually gives a 6x4 dataset due to the two-character strings being expanded to separate rows.
> 
> S = 
>     w    x    y    z
>     1    4    a    7
>     2    5    b    8
>     3    6    c    9
>     0    0    d    0
>     0    0    e    0
>     0    0    f    0
> 
> Needless to say, this is unexpected behavior and I am unsure whether I do not sufficiently understand cell arrays, or if this is an actual bug in the language. I'm leaning towards a bug because if this behavior was introduced by design, I would think that the integer rows would have been duplicated as the strings in that row were expanded, rather than the dataset being padded with zeros, e.g.

Not a bug, and only partially having to do with the finer points of cell arrays.  It's this line here:

     S.(data_fields{i})=[data{:,i}]';

that blindly takes {'ab'; 'cd'; 'ef'}, and turns it into ['abcdef']', a 6x1 column vector.  When you assign that as a new variable into the existing dataset array, you implicitly add rows with default values for the other variables (i.e., zeros).  This is essentially the same thing that would happen if you assigned to the (6,4)th element of a 4x2 double array, by the way.

So the code I wrote was simple, but not flexible enough for your needs.  You'll have to improve on it.  Put in an if test, and no worries.
0
Reply Peter 11/23/2009 8:58:26 PM

Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message <heet1i$5u8$1@fred.mathworks.com>...
> Bob wrote:
> 
> > I have replicated the problem with the following code:
> > 
> >   data = {1 4 'ab' 7; 2 5 'cd' 8; 3 6 'ef' 9};
> >   data_fields = {'w' 'x' 'y' 'z'};
> >   S = dataset();
> >   for i=1:length(data_fields)
> >     S.(data_fields{i})=[data{:,i}]';
> >   end
> > 
> > This should give a dataset with the same dimensions - 3 rows, 4 columns. However, it actually gives a 6x4 dataset due to the two-character strings being expanded to separate rows.
> > 
> > S = 
> >     w    x    y    z
> >     1    4    a    7
> >     2    5    b    8
> >     3    6    c    9
> >     0    0    d    0
> >     0    0    e    0
> >     0    0    f    0
> > 
> > Needless to say, this is unexpected behavior and I am unsure whether I do not sufficiently understand cell arrays, or if this is an actual bug in the language. I'm leaning towards a bug because if this behavior was introduced by design, I would think that the integer rows would have been duplicated as the strings in that row were expanded, rather than the dataset being padded with zeros, e.g.
> 
> Not a bug, and only partially having to do with the finer points of cell arrays.  It's this line here:
> 
>      S.(data_fields{i})=[data{:,i}]';
> 
> that blindly takes {'ab'; 'cd'; 'ef'}, and turns it into ['abcdef']', a 6x1 column vector.  When you assign that as a new variable into the existing dataset array, you implicitly add rows with default values for the other variables (i.e., zeros).  This is essentially the same thing that would happen if you assigned to the (6,4)th element of a 4x2 double array, by the way.
> 
> So the code I wrote was simple, but not flexible enough for your needs.  You'll have to improve on it.  Put in an if test, and no worries.

On the subject of arrays, how would this work if there were more than two cell arrays?
0
Reply CyberFrog 4/19/2010 3:07:05 AM

12 Replies
858 Views

(page loaded in 0.231 seconds)

Similiar Articles:


















7/22/2012 6:18:11 PM


Reply: