understanding a large code base (efficiently)

  • Follow


Anyone out there have tips on how to get a solid understanding of
large code base (that you did not author) in an efficient manner?
What approaches have worked best for you when faced with maintaining
"someone else's code"?

-Thx

0
Reply kilik3000 (12) 9/6/2007 5:48:57 PM

kilik3000@gmail.com wrote:
> Anyone out there have tips on how to get a solid understanding of
> large code base (that you did not author) in an efficient manner?
> What approaches have worked best for you when faced with maintaining
> "someone else's code"?


a) first - learn how to build it in debug mode

b) find which "component" (libraries/modules) constituents do what

c) run doxygen over all the code and browse it.

d) pick a few small things to fix and fix them - forces you to learn - 
don't check it in until you are sure the fixes were right.

Those 4 things will get half way up the learning curve fast but the 
other half is just time consuming. It's questionable if it's worth 
knowing code that well. Code today consists of lots of other libraries 
(like libxml2, openssl etc) and I have found myself needing to dig into 
those libraries at times to comprehend how and why it's doing what it's 
doing.  If I spent the time to understand it thoroughly, I don't think 
I'd be very productive on other parts.
0
Reply Gianni 9/6/2007 11:36:07 PM


On Thu, 6 Sep 2007 17:48:57 UTC,  kilik3000@gmail.com wrote:

> Anyone out there have tips on how to get a solid understanding of
> large code base (that you did not author) in an efficient manner?
> What approaches have worked best for you when faced with maintaining
> "someone else's code"?

I usually start with an outside-in approach to learning what the
code/application does.  I can usually imagine the internals.

Then I see what components and layers the code has as building
blocks to start creating a mental model.

Fill in the details as you need them.

David
0
Reply Eagle 9/7/2007 2:02:19 AM

On 6 Sep, 18:48, kilik3...@gmail.com wrote:

> Anyone out there have tips on how to get a solid understanding of
> large code base (that you did not author) in an efficient manner?
> What approaches have worked best for you when faced with maintaining
> "someone else's code"?

"Understand For C++" (I assume it does other languages) is good but
expensive. I've used it on a largeish code base that I was unfamilar
with (well twice actaully) and it is very helpful.


--
Nick Keighley

Fortran is like a shark, very old and effective.
Tor Rustad

0
Reply Nick 9/7/2007 10:14:38 AM

On Sep 6, 6:48 pm, kilik3...@gmail.com wrote:
> Anyone out there have tips on how to get a solid understanding of
> large code base (that you did not author) in an efficient manner?
> What approaches have worked best for you when faced with maintaining
> "someone else's code"?
>
> -Thx

I do this a lot (learn about large and complex new APIs). Not to
maintain them but to work with them and teach about them.

I use a fairly logical methodology to explore the API. It works well
up to a point. It is in fact based on a simple method for assessing
API usability (we published a paper on this at ICCE a few years ago).
The process is demanding and frustrating but it provides a framework
so I know how to tackle the problem. I am working on DirectFB at the
moment so will use that for illustration:

1) Identify the intended application domain of the API. This is not
easy - most APIs are confused as to what their purpose is (for
instance a graphics API may also provide data buffering, which is
useful in graphics but is not really its core domain and so should be
left to some other API that is better designed for that aspect). The
statement of domain should be short and readily understood by someone
who works in that domain - otherwise you have not properly understood.
The statement of the originator of the API is often not helpful. For
instance DirectFB is described as:

"=2E..a thin library that provides developers with hardware graphics
acceleration, input device handling and abstraction, integrated
windowing system with support for translucent windows and multiple
display layers ..."

This is too many things. I can't start to understnd an API so broad in
its aims. So I have to simplify, or clarify, or (better) identify what
the API actually does most. DirectFB does three things:
=B7	drawing:	creating a pixel map on a single graphic surface
=B7	mixing:	combining two or more graphic surfaces
=B7	windows	managing events and inputs for display windows
It also provides other functions - like handling input devices and
data buffers - that can be useful in relation to its three main
purposes. These may be thought of as supporting utilities. That is,
they support or are handy 'add-ons' to the core functionality but we
sould not get distracted onto thinking about them on their own. So my
starting definition to understand DirectFB is to say it is an API for
'drawing, mixing and window management'. That is a value judgement,
others may disagree, but at this stage my aim is to simplify so that I
can start to understand.

2) Identify functions that address the core domain of application, and
distinguish these from those that support the API or are just stuck on
by accretion. If the API is 'flat' (tat is, it does not have a
heirarchy of layers) then I list the API functions and data structures
and place them in three categories:
- application oriented elements address the application directly
- infrastructure oriented elements support the API without adding
application functionality
- accretions are things that seem added on for no good reason - often
they do things that are already done by other APIs and should not be
polluting this one. For instance DirectFB has functions to set up and
handle date buffers - functionality that is much better done by data
buffering APIs than a graphics one. These I leave till last because
they are a distraction and a nuisance.

3) Identify layers and groups of functionality, and relate these to
(1) the statement of domain of application. If the API is layered
(that is, has a heirarchy of functions) then I have to identify the
layers. My use of the word 'layered' here is broad - I include for
instance interfaces (collections of functions that address a single
object) as a higher layer, and the functions themselves as a lower
layer. I then treat each layer in the same way - identifying its sub-
domain of application, or the objects that it addresses, and analysing
each of these as a sub-API. In this way I can keep each piece of
analysis tractable. For instance DirectFB has interfaces - for objects
like Surfaces and DisplayLayers and Windows - and I can identify the
sub-domain of these - for example DisplayLayers obviously deal with
the objects 'DisplayLayers'. This gets very difficult but is
worthwhile. The API is usually organized in a horrible way. For
instance DirectFB has a 'Surface' interface that contains all the
functions that address a Surface (a container for a pixel map). But
these include functions with different aims - drawing, management of
the Surface properyies, mixing with other Surfaces, etc. Here I have
to go back to my definition and work out what the sub-divisions are,
related to that definition of the domain. Since I say DirectFB is for
'drawing, mixing and window management' then I may decide to sub-
divide the single huge Surface interface collection of functions into
those that are for drawing (application), those that are for mixing
(application), and those that manage the Surface (infrastructure). As
I do this I often have to re-assess my definition but I take the time
to do that so that I keep a good logical map that is synchronized in
my mind (or hopefully on paper..) at overview and at detailed levels.
Try to make your groups of functions small (20 functions at most) so
you can hope to understand them in depth: very large groups tend to
fill my mind with too much detail.

4) Start at the highest layer (eg interfaces rather than individual
functions. Identify 'objects' (loose definition meaning 'things that
the API addresses and that are meaningful to the user of the API') and
clarify what are the things that can be done to them or by them. Also
clarify the inter-relations. With luck, there will be few interactions
between the objects and many within the objects. At this stage, fight
the depair - most APIs are horrible, they are designed from an
internal perspective that has everything to do with implementation and
little to do with use. I find that having someone to rant to about the
obtuseness and lack of overall logical clarity of programmers helps at
this stage. (Don't get me wrong, progarmmers are amazing, brilliant
people - but they are often not strong at the broad sweep of logical
consistency).

5) Work down through the layers, checking the purpose and effect of
each function and data structure in detail. You can go as far as you
like. I usually check source code for all but the most obvious of
functions, unless the documentation is comprehensive and I trust it. I
do not trust doxygen documentation because programmers usually do not
document clearly what their intention was in a way that makes sense to
the user: and often comments are not updated to reflect actual code.
It is horrible to learn an API from what its doucmentation says it
does, only to find that it in fact does something different.

At this stage you should build up a list of all the core API
functions, objects and data structures. If this pre-exists (it would
be called 'documentation'..) then instead of writing it you can simply
check it off. If you wrote it then it will be organized according to
the simply stated domain of intended application, subdivided into
smaller logical groups that share a common but smaller purpose. If you
work from pre-existing comprehensive documentation then you can either
stick with their view (which may not be logically consistent or
simple) or do it my way anyhow. This is the basis for your
understanding of the API.

While you are doing this you should be constantly doing two other
things:

a) Identify and resolve logical inconsistencies, duplications,
distractions and confusions. For instance, DirectFB has functions with
the same name that, when called on different objects where one would
logically expect identical behavior, in fact do something quite
different. You need to identify these and note them - I call them the
'gotchas' and list them as potential traps for the unwary. The idea is
to avoid assuming that a function does soemthing because that is the
obvious, or logical, or consistent, thing for it to do - you have to
work out what it actually does which is very often neither what you
thought nor what it should. Duplications are where the API provides
more than one way to do a simple job: identify the alternate routes
and if possible find the one that is most commonly used or is most
logical (the 'idiom' of the API - that is, how people actually use
it). Distractions are functions that do not address the core domain.
For instance DirectFB's DataBuffers are useful but they duplicate
functionality that is provided by other APIs, so they tend to distract
from understanding more vital core functionality - do not get
distracted by these details and accretions. Confusions are where one
function or group of functions unexpectedly affects the effect of
another. For instance DirectFB has a horrid concept called the
'primary Surface' - where choices you make in setting up some overall
system parameters totally change what you actually do when you create
a Surface. Because objects A and B are unrelated (at first sight) the
programmer learning an API does not expect that function fa,
addressing objects of type A, then affects what function fb does when
addressing objects of type B. There is a lot of thinking and ranting
at this stage. I find that once I can think of reasons why the API
designer might have made this seemingly bizarre and contrary choice,
and can at least sympathize with that design choice, then I am
beginning to understand the API. You don't have to agree with the
design (I rarely do) but you must understand it.

b) Test your assumptions by writing or at least checking against
actual code. Where necessary, check the source code but do not assume
that is correct - it may be wrong, so you often must make a judgement
if you cannot verify with someone who has authority to make a ruling.
If there are experts, then check with them also any difficult
judgements - but again, do not assume they are correct, they may have
become expert at making mistakes that are so well accepted as to
become traditions.

As I do this, and purely as a personal way of working, I maintain a
list of every function, object and data structure in the API. As I
visit it I mark it as such, and when I finalize a decision on
interpretation I mark it again. That way I can make sure not to miss
anything.

Also, note that I very often almost have to go back and strt again
because it is so easy to make a mistaken assumption. The worst is, I
find, that often programmers use an API in a way that is not really
sensible - because they tend to follow previous examples and so any
misunderstandings propagate themselves.

Good luck,

Chris
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Chris Bore
BORES Signal Processing
www.bores.com



0
Reply Chris 9/7/2007 11:34:17 AM

On Sep 6, 10:48 am, kilik3...@gmail.com wrote:
> Anyone out there have tips on how to get a solid understanding of
> large code base (that you did not author) in an efficient manner?
> What approaches have worked best for you when faced with maintaining
> "someone else's code"?

Rational Rose is nice, but it costs an arm and a leg.
There is a Redhat tool that will do a good job (can't remember the
name right now).
I like Doxygen, but it produces a huge volume to wade through.

The burning question here is : "Why wasn't the code base documented in
the first place?"
That is a clear indication of shoddy workmanship.

0
Reply user923005 9/7/2007 11:17:54 PM

5 Replies
107 Views

(page loaded in 0.132 seconds)


Reply: