> On Fri, 2007-09-02 at 08:59 -0500, Weikuan Yu wrote:
> > Hi,
>
> Hi,
>
> > 2) Inaccurate reports due to any drift in time
> >
> > -- Time drift is an annoying problem of IOR. IOR checks on the skew
> > of timestamps from each process. But it does not calibrate the timer
at
> > the beginning. So it spews numerous warning on systems with big drift
> > across nodes. In addition, it reports wrong numbers for IO rate. Added
> > recalibration still makes your numbers more accurate, even if you did
not
> > notice you have a problem before.
>
> We are encountering this issue and am recalling this posting of yours.
> After some investigation, un-synchronized clocks between nodes is not
> the only problem.
>
> For example, I have a cluster of 127 nodes. I use ntp to keep that
> cluster in sync as can be seen:
>
> $ pdsh -S -w o[1-2,4-128] date | dshbak -c
> ----------------
> o[98-128]
> ----------------
> Mon Mar 19 22:25:54 GMT 2007
> ----------------
> o[66-97]
> ----------------
> Mon Mar 19 22:25:53 GMT 2007
> ----------------
> o[1-2,5,14,19,34-65]
> ----------------
> Mon Mar 19 22:25:52 GMT 2007
> ----------------
> o[4,6-13,15-18,20-33]
> ----------------
> Mon Mar 19 22:25:51 GMT 2007
>
> As you can see, at most between any two nodes we only have 4 seconds of
> drift, yet MPI registers a much much bigger slew between nodes:
>
> $ mpirun -np 127 -machinefile machfile -nolocal mpi_time | sort -k 1 -n -k
> 6 pass1: on o128 time is 1.320312
> pass1: on o127 time is 2.187500
> pass1: on o126 time is 3.273438
> pass1: on o125 time is 4.140625
> pass1: on o124 time is 5.222656
> ...
> pass1: on o6 time is 120.476562
> pass1: on o5 time is 121.542969
> pass1: on o4 time is 122.625000
> pass1: on o2 time is 123.691406
> pass1: on o1 time is 124.488281
>
> And this deviation in pass1 correlates quite closely to the amount of
> time it takes mpich1 to get all of the nodes up and running with the MPI
> program:
>
> o1: Mar 19 21:40:43 orion1 in.rshd[10105]: root@o1 as root:
> cmd=''/usr/src/brian/mpi_time o1 34415 \-p4amslave \-p4yourname
orion1
> \-p4rmrank 1'' o2: Mar 19 22:56:15 orion2 in.rshd[9942]: root@o1 as
root:
> cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname
orion2
> \-p4rmrank 1'' o4: Mar 19 22:56:16 orion4 in.rshd[9765]: root@o1 as
root:
> cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname
orion4
> \-p4rmrank 2'' o5: Mar 19 22:56:17 orion5 in.rshd[9752]: root@o1 as
root:
> cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname
orion5
> \-p4rmrank 3'' o6: Mar 19 22:56:18 orion6 in.rshd[9731]: root@o1 as
root:
> cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname
orion6
> \-p4rmrank 4'' ...
> o124: Mar 19 22:58:13 orion124 in.rshd[9193]: root@o1 as root:
> cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname
orion124
> \-p4rmrank 122'' o125: Mar 19 22:58:14 orion125 in.rshd[9193]:
root@o1 as
> root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave
\-p4yourname
> orion125 \-p4rmrank 123'' o126: Mar 19 22:58:15 orion126
in.rshd[9187]:
> root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave
> \-p4yourname orion126 \-p4rmrank 124'' o127: Mar 19 22:58:16
orion127
> in.rshd[9202]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1
35073
> \-p4amslave \-p4yourname orion127 \-p4rmrank 125'' o128: Mar 19
22:58:17
> orion128 in.rshd[9186]: root@o1 as root:
cmd=''/usr/src/brian/mpi_time o1
> 35073 \-p4amslave \-p4yourname orion128 \-p4rmrank 126''
>
> So the time deviation is counting the 2 or so minutes it takes to get
> 127 nodes all up and running the MPI program.
>
> If I (much as you did with your patch) take the initial timestamp and
> correct the time returned by GetTimeStamp() with it, I get a much better
> value from it:
>
> pass2: on o2 time is 0.000000
> pass2: on o67 time is 0.000000
> pass2: on o99 time is 0.000000
> pass2: on o35 time is 0.003906
> pass2: on o100 time is 0.093750
> ...
> pass2: on o91 time is 0.402344
> pass2: on o93 time is 0.402344
> pass2: on o95 time is 0.402344
> pass2: on o97 time is 0.402344
> pass2: on o9 time is 0.402344
>
> Since I have not really audited IOR to the point of understanding all of
> the timekeeping in it, I wonder, is this algorithm (correcting for the
> difference in startup times of the remote processes) incorrect?
Hi Brian,
I handle most of the development/maintenance of IOR, and I''m not sure
why the
"pass1" output seems to be showing serialized behavior. At a minimum,
the
source of mpi_time.c would need a call to TimeDeviation() in IOR.c. This
function calculates the time offset for each rank against rank 0. Without
the call, there is no adjustment for clock skew -- or rather, the adjustment
is always by 0 for all tasks.
The latest version of IOR contains Weikuan''s patches (massaged to
remove the
redundancy you noted) if you want to try working with that. Let me know if
you''d like a tarball of this version (the ftp website is not available
right
now). Also, if you''re interested in following any of this up with me,
I''d be
glad to help -- but, I''m not on the lustre-discuss list, so please cc
me.
Thanks,
-- Bill.
> The source for mpi_time.c:
>
> main(int argc, char **argv) {
>
> double initial_timestamp, timestamp;
> int namelen;
> char processor_name[MPI_MAX_PROCESSOR_NAME];
> int numTasksWorld = 0;
> int rank = 0;
>
> /* start the MPI code */
> MPI_CHECK(MPI_Init(&argc, &argv), "cannot initialize
MPI");
> MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD, &numTasksWorld),
> "cannot get number of tasks");
> MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD, &rank), "cannot get
rank");
> MPI_CHECK(MPI_Get_processor_name(processor_name, &namelen),
> "cannot get processor name");
>
> MPI_CHECK(MPI_Barrier(MPI_COMM_WORLD), "barrier error");
> initial_timestamp = GetTimeStamp();
> fprintf(stdout, "pass1: on %s time is %f\n", processor_name,
> initial_timestamp);
> MPI_CHECK(MPI_Barrier(MPI_COMM_WORLD), "barrier error");
> timestamp = GetTimeStamp();
> fprintf(stdout, "pass2: on %s time is %f\n", processor_name,
> timestamp - initial_timestamp);
>
> MPI_CHECK(MPI_Finalize(), "cannot finalize MPI");
> return 0;
>
> }
>
> (which uses MPI_CHECK() and GetTimeStamp() from IOR.c)
>
> > Let me know if you may have some comments.
>
> The only thing I''d say is that your implementation of
InitTimeStamp()
> seems almost redundant given GetTimeStamp(). Why not pass correction as
> an argument to GetTimeStamp() and set it to 0 to initialize and the
> init_timeval thereafter? Too messy? Perhaps.
>
> init_timeval = GetTimeStamp(0) to initialize and
> GetTimeStamp(init_timeval) thereafter
>
> I wonder in fact if GetTimeStamp() can''t do this initialization
and
> always account for the init_timeval? init_timeval could be stored as a
> static in GetTimeStamp() along with an "initialized" static
boolean so
> that fist time through it''s initialized. It''s tempting
to use a 0 in
> init_timeval as the "not initialized" flag but it could
legitimately be
> 0 I think.
>
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss