Hi, I could not find a IOR list. Figured that there are a lot of folks on this list are using IOR, who might be interested. Here is a patch to eliminate two oversights of IOR. 1) Conflict in the usage of FILE_DELIMITER '':'' 2) Inaccurate reports due to any drift in time Note: -- '':'' is already taken by ROMIO for long for specifying file system type. By replacing it with ''@'', the problem is solved. File system specification is again possible. -- Time drift is an annoying problem of IOR. IOR checks on the skew of timestamps from each process. But it does not calibrate the timer at the beginning. So it spews numerous warning on systems with big drift across nodes. In addition, it reports wrong numbers for IO rate. Added recalibration still makes your numbers more accurate, even if you did not notice you have a problem before. Let me know if you may have some comments. Thanks, Weikuan ++++++++++++++++++++++++++++ Weikuan Yu <+> 1-865-574-7990 http://www.csm.ornl.gov/~wyu/ -------------- next part -------------- diff -pruN IOR-2.9.1/src/C/IOR.c IOR-2.9.1.new/src/C/IOR.c --- IOR-2.9.1/src/C/IOR.c 2006-09-26 16:13:33.000000000 -0400 +++ IOR-2.9.1.new/src/C/IOR.c 2007-02-08 22:53:24.000000000 -0500 @@ -105,6 +105,8 @@ main(int argc, /*MPI_CHECK(MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN), "cannot set errhandler");*/ + InitTimeStamp(); + /* setup tests before verifying test validity */ tests = SetupTests(argc, argv); verbose = tests->testParameters.verbose; @@ -835,6 +837,32 @@ GetTestFileName(char * testFileName, IOR } } /* GetTestFileName() */ +double init_timeval; + +double +InitTimeStamp(void) +{ + double timeVal; +#ifdef _NO_MPI_TIMER + struct timeval timer; + + if (gettimeofday(&timer, (struct timezone *)NULL) != 0) + ERR("cannot use gettimeofday()"); + timeVal = (double)timer.tv_sec + ((double)timer.tv_usec/1000000); +#else /* not _NO_MPI_TIMER */ + timeVal = MPI_Wtime(); /* no MPI_CHECK(), just check return value */ + if (timeVal < 0) ERR("cannot use MPI_Wtime()"); +#endif /* _NO_MPI_TIMER */ + + /* wall_clock_delta is difference from root node''s time. if significant + wall clock deviation, this is necessary correction; else it''s set to + zero. */ + timeVal -= wall_clock_delta; + init_timeval = timeVal; + + return(timeVal); +} /* initTimeStamp() */ + /******************************************************************************/ /* @@ -862,7 +890,7 @@ GetTimeStamp(void) zero. */ timeVal -= wall_clock_delta; - return(timeVal); + return(timeVal-init_timeval); } /* GetTimeStamp() */ diff -pruN IOR-2.9.1/src/C/iordef.h IOR-2.9.1.new/src/C/iordef.h --- IOR-2.9.1/src/C/iordef.h 2006-10-02 13:34:54.000000000 -0400 +++ IOR-2.9.1.new/src/C/iordef.h 2007-02-08 22:02:39.000000000 -0500 @@ -84,7 +84,7 @@ extern int numTasks, #define WC_OL_THRESHOLD 5 /* outlier threshold in sec */ #define DELIMITERS " \t\r\n=" /* ReadScript() */ -#define FILENAME_DELIMITER '':'' /* ParseFileName() */ +#define FILENAME_DELIMITER ''@'' /* ParseFileName() */ /* MACROs for debugging */ #define HERE fprintf(stdout, "** LINE %d (TASK=%d) **\n", \ diff -pruN IOR-2.9.1/src/C/IOR.h IOR-2.9.1.new/src/C/IOR.h --- IOR-2.9.1/src/C/IOR.h 2006-07-11 20:04:46.000000000 -0400 +++ IOR-2.9.1.new/src/C/IOR.h 2007-02-08 23:00:00.000000000 -0500 @@ -55,6 +55,7 @@ void FillBuffer (void *, unsigned long long, int); void GetPlatformName (char *); void GetTestFileName (char *, IOR_param_t *); +double InitTimeStamp (void); double GetTimeStamp (void); char * HumanReadable (IOR_offset_t, int); IOR_offset_t IOR_GetFileSize_POSIX (IOR_param_t *, MPI_Comm, char *);
On Fri, 2007-02-09 at 08:59 -0500, Weikuan Yu wrote: Hi. I''d like to briefly explore this second problem...> 2) Inaccurate reports due to any drift in time> -- Time drift is an annoying problem of IOR.What do you mean exactly by time drift? "Drift" to me means that the difference in the clocks of machines actually grows, not that they are simply just not synchronized yet constant. If you have drift in terms of clocks actually growing apart, do you know why this is? Are you using ntp to (try) to keep the clocks in sync? What if you synchronize the clocks (i.e. with ntpdate or rdate, etc.) of all of the nodes right before the IOR run? Are they still all in sync after the run ends?> IOR checks on the skew > of timestamps from each process. But it does not calibrate the timer at > the beginning. So it spews numerous warning on systems with big drift > across nodes.What warnings were you seeing? What version of Lustre are you using? b.
Sounds like you were the lucky ones... I do not think your definition of "drift" is very different from what my understanding. That said, the patch is to remove the IOR reliance on good time synchronization, like ntp. It has no direct linkage with lustre, just so happens that I noticed this list has had some reports using IOR. In addition, the granularity of ntp is mostly at milliseconds level assuming your nodes sync frequently enough and always able to do so. The patch brings the clock skew down to the best that a give MPI implementation can achieve, i.e. the granularity of a barrier (inherent in Init()). Weikuan Brian J. Murrell wrote:> On Fri, 2007-02-09 at 08:59 -0500, Weikuan Yu wrote: > > Hi. > > I''d like to briefly explore this second problem... > >> 2) Inaccurate reports due to any drift in time > >> -- Time drift is an annoying problem of IOR. > > What do you mean exactly by time drift? "Drift" to me means that the > difference in the clocks of machines actually grows, not that they are > simply just not synchronized yet constant. If you have drift in terms > of clocks actually growing apart, do you know why this is? Are you > using ntp to (try) to keep the clocks in sync? > > What if you synchronize the clocks (i.e. with ntpdate or rdate, etc.) of > all of the nodes right before the IOR run? Are they still all in sync > after the run ends? > >> IOR checks on the skew >> of timestamps from each process. But it does not calibrate the timer at >> the beginning. So it spews numerous warning on systems with big drift >> across nodes. > > What warnings were you seeing? > > What version of Lustre are you using? > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
Hi Weikuan, I handle most of the IOR maintenance, and I agree the FILE_DELIMITER needs to be changed from '':'' to something else. The ''@'' change seems a good fit for IOR users in the past who''ve had problems with '':'', so I''ll update the source to reflect this new default. As for the time drift, I think the time drift between nodes should be calibrated so all nodes see the same starting time. Currently, IOR has ignored recalibrating if the skew is less than 5 seconds between the earliest and latest times. If the skew is too egregious (> 5), then all tasks use the root task''s time. I agree that a better approach would be to have IOR adjust for time drift without regard to the wallclock outlier threshold. I''ll make this change. Also, I am going to remove the "WARNING: Time deviation . . ." message. (I agree it can be annoying.) I will leave the "Wall clock deviation: X.Y sec", however. There are cases where we need to see how badly the nodes are out of synchronization timewise. I don''t think this is too intrusive to the output. If you have any follow up comments to the lustre-discuss list on these changes or IOR in general, please cc me as well. Thanks, -- Bill.> Date: Fri, 09 Feb 2007 08:59:08 -0500 > From: Weikuan Yu <weikuan.yu@gmail.com> > To: lustre <lustre-discuss@clusterfs.com> > Subject: [Lustre-discuss] IOR patch > > Hi, > > I could not find a IOR list. Figured that there are a lot of folks on this > list are using IOR, who might be interested. Here is a patch to eliminate > two oversights of IOR. > > 1) Conflict in the usage of FILE_DELIMITER '':'' > 2) Inaccurate reports due to any drift in time > > Note: > -- '':'' is already taken by ROMIO for long for specifying file system > type. By replacing it with ''@'', the problem is solved. File system > specification is again possible. > -- Time drift is an annoying problem of IOR. IOR checks on the skew > of timestamps from each process. But it does not calibrate the timer at > the beginning. So it spews numerous warning on systems with big drift > across nodes. In addition, it reports wrong numbers for IO rate. Added > recalibration still makes your numbers more accurate, even if you did not > notice you have a problem before. > > Let me know if you may have some comments. > > Thanks, > Weikuan > ++++++++++++++++++++++++++++ > Weikuan Yu <+> 1-865-574-7990 > http://www.csm.ornl.gov/~wyu/
yujian
2007-Mar-05 01:08 UTC
[Lustre-discuss] Data type issue of tmpOffset in 2.9-series IOR
Hello Bill, Recently, when we ran IOR (version 2.9.0) in the MPIIO mode to test the performance of Lustre, we encountered the following error: ** error ** ERROR in aiori-MPIIO.c (line 303): cannot access explicit, noncollective. MPI Invalid argument ** exiting ** After looking into the IOR codes, we found that the offset parameter passed to MPI_File_read_at() api was negative, and this was due to the incorrect data type of tmpOffset variable in WriteOrRead() function in IOR.c. It''s declared as "int", and we think it should be "IOR_offset_t". Could you please take a look at this issue? BTW: this issue also exists in other 2.9-series IORs. Thanks. -- Best regards, Yu Jian
Bill Loewe
2007-Mar-05 22:43 UTC
[Lustre-discuss] Re: Data type issue of tmpOffset in 2.9-series IOR
Hi, Thanks for catching this. I agree that IOR_offset_t (long long int) should be used, not the 32-bit int in the source. I''ll make that change. Thanks, -- Bill. On Monday 05 March 2007 00:08, yujian wrote:> Hello Bill, > > Recently, when we ran IOR (version 2.9.0) in the MPIIO mode to test the > performance of Lustre, we encountered the following error: > > ** error ** > ERROR in aiori-MPIIO.c (line 303): cannot access explicit, noncollective. > MPI Invalid argument > ** exiting ** > > After looking into the IOR codes, we found that the offset parameter > passed to MPI_File_read_at() api was negative, and this was due to the > incorrect data type of tmpOffset variable in WriteOrRead() function in > IOR.c. It''s declared as "int", and we think it should be "IOR_offset_t". > Could you please take a look at this issue? > > BTW: this issue also exists in other 2.9-series IORs. > > Thanks.
On Fri, 2007-09-02 at 08:59 -0500, Weikuan Yu wrote:> Hi,Hi,> 2) Inaccurate reports due to any drift in time> -- Time drift is an annoying problem of IOR. IOR checks on the skew > of timestamps from each process. But it does not calibrate the timer at > the beginning. So it spews numerous warning on systems with big drift > across nodes. In addition, it reports wrong numbers for IO rate. Added > recalibration still makes your numbers more accurate, even if you did not > notice you have a problem before.We are encountering this issue and am recalling this posting of yours. After some investigation, un-synchronized clocks between nodes is not the only problem. For example, I have a cluster of 127 nodes. I use ntp to keep that cluster in sync as can be seen: $ pdsh -S -w o[1-2,4-128] date | dshbak -c ---------------- o[98-128] ---------------- Mon Mar 19 22:25:54 GMT 2007 ---------------- o[66-97] ---------------- Mon Mar 19 22:25:53 GMT 2007 ---------------- o[1-2,5,14,19,34-65] ---------------- Mon Mar 19 22:25:52 GMT 2007 ---------------- o[4,6-13,15-18,20-33] ---------------- Mon Mar 19 22:25:51 GMT 2007 As you can see, at most between any two nodes we only have 4 seconds of drift, yet MPI registers a much much bigger slew between nodes: $ mpirun -np 127 -machinefile machfile -nolocal mpi_time | sort -k 1 -n -k 6 pass1: on o128 time is 1.320312 pass1: on o127 time is 2.187500 pass1: on o126 time is 3.273438 pass1: on o125 time is 4.140625 pass1: on o124 time is 5.222656 ... pass1: on o6 time is 120.476562 pass1: on o5 time is 121.542969 pass1: on o4 time is 122.625000 pass1: on o2 time is 123.691406 pass1: on o1 time is 124.488281 And this deviation in pass1 correlates quite closely to the amount of time it takes mpich1 to get all of the nodes up and running with the MPI program: o1: Mar 19 21:40:43 orion1 in.rshd[10105]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 34415 \-p4amslave \-p4yourname orion1 \-p4rmrank 1'' o2: Mar 19 22:56:15 orion2 in.rshd[9942]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion2 \-p4rmrank 1'' o4: Mar 19 22:56:16 orion4 in.rshd[9765]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion4 \-p4rmrank 2'' o5: Mar 19 22:56:17 orion5 in.rshd[9752]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion5 \-p4rmrank 3'' o6: Mar 19 22:56:18 orion6 in.rshd[9731]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion6 \-p4rmrank 4'' ... o124: Mar 19 22:58:13 orion124 in.rshd[9193]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion124 \-p4rmrank 122'' o125: Mar 19 22:58:14 orion125 in.rshd[9193]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion125 \-p4rmrank 123'' o126: Mar 19 22:58:15 orion126 in.rshd[9187]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion126 \-p4rmrank 124'' o127: Mar 19 22:58:16 orion127 in.rshd[9202]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion127 \-p4rmrank 125'' o128: Mar 19 22:58:17 orion128 in.rshd[9186]: root@o1 as root: cmd=''/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion128 \-p4rmrank 126'' So the time deviation is counting the 2 or so minutes it takes to get 127 nodes all up and running the MPI program. If I (much as you did with your patch) take the initial timestamp and correct the time returned by GetTimeStamp() with it, I get a much better value from it: pass2: on o2 time is 0.000000 pass2: on o67 time is 0.000000 pass2: on o99 time is 0.000000 pass2: on o35 time is 0.003906 pass2: on o100 time is 0.093750 ... pass2: on o91 time is 0.402344 pass2: on o93 time is 0.402344 pass2: on o95 time is 0.402344 pass2: on o97 time is 0.402344 pass2: on o9 time is 0.402344 Since I have not really audited IOR to the point of understanding all of the timekeeping in it, I wonder, is this algorithm (correcting for the difference in startup times of the remote processes) incorrect? The source for mpi_time.c: main(int argc, char **argv) { double initial_timestamp, timestamp; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int numTasksWorld = 0; int rank = 0; /* start the MPI code */ MPI_CHECK(MPI_Init(&argc, &argv), "cannot initialize MPI"); MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD, &numTasksWorld), "cannot get number of tasks"); MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD, &rank), "cannot get rank"); MPI_CHECK(MPI_Get_processor_name(processor_name, &namelen), "cannot get processor name"); MPI_CHECK(MPI_Barrier(MPI_COMM_WORLD), "barrier error"); initial_timestamp = GetTimeStamp(); fprintf(stdout, "pass1: on %s time is %f\n", processor_name, initial_timestamp); MPI_CHECK(MPI_Barrier(MPI_COMM_WORLD), "barrier error"); timestamp = GetTimeStamp(); fprintf(stdout, "pass2: on %s time is %f\n", processor_name, timestamp - initial_timestamp); MPI_CHECK(MPI_Finalize(), "cannot finalize MPI"); return 0; } (which uses MPI_CHECK() and GetTimeStamp() from IOR.c)> Let me know if you may have some comments.The only thing I''d say is that your implementation of InitTimeStamp() seems almost redundant given GetTimeStamp(). Why not pass correction as an argument to GetTimeStamp() and set it to 0 to initialize and the init_timeval thereafter? Too messy? Perhaps. init_timeval = GetTimeStamp(0) to initialize and GetTimeStamp(init_timeval) thereafter I wonder in fact if GetTimeStamp() can''t do this initialization and always account for the init_timeval? init_timeval could be stored as a static in GetTimeStamp() along with an "initialized" static boolean so that fist time through it''s initialized. It''s tempting to use a 0 in init_timeval as the "not initialized" flag but it could legitimately be 0 I think. b.