Testing Padls Total

Message boards : News : Testing Padls Total
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3316 - Posted: 18 Jun 2019, 21:01:09 UTC
Last modified: 30 Jun 2019, 11:53:01 UTC

Experiment with name Padls Total now entered testing phase.
A new application was developed for this project which is currently available as a beta for Linux. This application finally supports checkpoints. Please, decide if you really want to participate in the beta work. The work generator is also being tested, so there are many tasks available and if you have beta work enabled, you may get way too much.

If you enable beta applications, be prepared to have:
* tasks suddenly aborted by server
* your completed tasks not validated for weeks
* no credit assigned for weeks
* tasks crash or get stuck running
* no apps for our favorite platform
* strict deadlines
* wrong run-time estimate

This is batch 15, 21 and 22 on the server_status page.

Another remark: it is not necessary to run the beta work continuously. Run it for a while and then disable it. Leave some to me to test. Credit allocation, windows application, automated result publication and deadline adjustments will all be done before the application leaves beta mode.

The server might not be available all the time as I am trying different storage solution.

Programming talk and development updates are in this thread.
ID: 3316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 16 Feb 19
Posts: 19
Credit: 1,079,411
RAC: 6,074
Message 3321 - Posted: 19 Jun 2019, 20:59:22 UTC

Working fine so far. The ETA was a bit short of run time so some didn't start before the short deadline.
ID: 3321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3323 - Posted: 20 Jun 2019, 8:55:38 UTC

mean: 1.935724e+13
stdev: 8.035272e+12
samples: 4932


This is the distribution of workunit runtime in FLOP.
The workunits are sent with estimated runtime of 14*10^12 FLOP (1.4e13), which is pretty close. I think I can move the estmate to 1.9e13, which is the mean value.
ID: 3323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3324 - Posted: 20 Jun 2019, 9:04:49 UTC

The workunit length is variable. I can make them longer or shorter in computation time.
By default it would run very very long time, but every time it checkpoints, it checks whether it already done enough work and if so, it finishes. Same when you shut down boinc.
You can set checkpoint interval in the boinc manager. By default it is set to one minute, but if you set it to one hour, the workunit will run up to one hour longer than designed. This is OK for the project: the results are fine and credit is assigned adequate to work performed.
ID: 3324 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3338 - Posted: 24 Jun 2019, 10:07:20 UTC
Last modified: 24 Jun 2019, 10:07:48 UTC


  • Created application for Windows. Please report issues.
  • Please disable beta applications if you are unable to monitor your computers.
  • Increased the wu length estimate to 19e12.

ID: 3338 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3352 - Posted: 25 Jun 2019, 14:30:17 UTC

Question: What do you think about increasing the workunit length?
Double?
ID: 3352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 16 Feb 19
Posts: 19
Credit: 1,079,411
RAC: 6,074
Message 3354 - Posted: 25 Jun 2019, 23:55:53 UTC

I'd be fine with doubling the length. Small units can stress server I/O.

No errors in the latest batch in Linux or Windows.
ID: 3354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3382 - Posted: 30 Jun 2019, 7:11:16 UTC

Workunits are gone! But do not worry, there will be more. I want to adjust the work generator and then remove the "beta" mark off the application to allow non-beta-testers to run it.
Still, everyone is welcome to look at the source code, find bugs and suggest improvements.
ID: 3382 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3384 - Posted: 30 Jun 2019, 10:08:24 UTC

Testing two-hour long workunits now.
I changed the assimilator so as soon as result is assimilated, new workunit is generated.
ID: 3384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3397 - Posted: 30 Jun 2019, 20:25:59 UTC

I noticed some very long run times. Over 8 hours. It just might be slow processor. Please check that you are getting adequate credit for such a long tasks.
ID: 3397 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3462 - Posted: 15 Jul 2019, 8:45:43 UTC

Only 567 odlk found are duplicate within the framework of this experiment (of 295894 total found).
ID: 3462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Project scientist
Avatar

Send message
Joined: 8 Feb 19
Posts: 162
Credit: 0
RAC: 0
Message 3463 - Posted: 15 Jul 2019, 9:04:29 UTC - in response to Message 3462.  
Last modified: 15 Jul 2019, 9:06:36 UTC

Only 567 odlk found are duplicate within the framework of this experiment (of 295894 total found).

This morning I found in the file https://boinc.tbrada.eu/download/tot_odlk_plain.txt 209725 CF ODLS (after decoding with the denamer program).
ID: 3463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMCOBILL

Send message
Joined: 4 Mar 19
Posts: 9
Credit: 434
RAC: 12
Message 3465 - Posted: 15 Jul 2019, 15:11:01 UTC - in response to Message 3397.  
Last modified: 15 Jul 2019, 15:21:03 UTC

I noticed some very long run times. Over 8 hours. It just might be slow processor. Please check that you are getting adequate credit for such a long tasks.


Most of my WUs are running Much Longer like 6,7, 8 days. Had to Abort all but 1. They go days on 0 time left.

Not sure if I'm unique, not running slow processors, but need to know how long should I run before aborting? The longest I ran was 8 days past Due Date.

Help

Thank You

Bill
ID: 3465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3466 - Posted: 15 Jul 2019, 19:58:47 UTC - in response to Message 3463.  

Indeed I messed up.
214932 are unique within the framework of this project
3 are duplicate
4 "others".
select cnt1, count(odlk) cnt2 from (SELECT odlk, count(segment) cnt1 FROM `tot_result_odlk` group by odlk order by cnt1 desc) q2 group by cnt1 
ID: 3466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3467 - Posted: 15 Jul 2019, 20:02:33 UTC - in response to Message 3465.  

Not sure if I'm unique, not running slow processors, but need to know how long should I run before aborting? The longest I ran was 8 days past Due Date.

The workunits are calibrated at 2 hours on Ryzen 1700 processor.
The task can only finish when it checkpoints. So make sure your checkpoint interval is set to less than a day.
I will look at the log of your tasks.
ID: 3467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada
Project administrator
Volunteer developer
Avatar

Send message
Joined: 3 Feb 19
Posts: 276
Credit: 106,040
RAC: 162
Message 3468 - Posted: 15 Jul 2019, 20:25:23 UTC
Last modified: 15 Jul 2019, 20:25:50 UTC

For example: https://boinc.tbrada.eu/result.php?resultid=293963
This task tan for whooping 7 and a half day. It is imperative to investigate why. The task did regulary checkpoint. The checkpoint was not uploaded to server, because boinc considers the task as failure and deletes the partial results, even thought it worked.
The task was later replicated and finished in 2h as it should. The subsequent task in the same segment also finished fine.

Please look for tasks that take exceptionally long. Give it a suspend and resume or restart boinc to trigger checkpoint, then, if possible, upload the checkpoint file from boinc slot directory (before aborting).
ID: 3468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 24 Feb 19
Posts: 6
Credit: 1,000,329
RAC: 115
Message 3470 - Posted: 15 Jul 2019, 22:04:27 UTC - in response to Message 3468.  
Last modified: 15 Jul 2019, 22:06:05 UTC

Not sure how much file io there is when checkpointing, especially if on old slow hard drives, but this task (that got the credit) ran for over 3 days,
https://boinc.tbrada.eu/result.php?resultid=295454

It looks like it was checkpointing like there was no tomorrow, did it spend most of its time checkpointing rather than doing the calculation ?

Try changing your computing preferences to only checkpoint every 300 seconds and see if that improves things.
ID: 3470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMCOBILL

Send message
Joined: 4 Mar 19
Posts: 9
Credit: 434
RAC: 12
Message 3472 - Posted: 16 Jul 2019, 2:06:47 UTC - in response to Message 3470.  

Here is info no WU running 1+ day:


Application
PADLS Total 5.07
Name
tot5_51c_Su6W2nL1F6QmrDRkL1exWpHiD
State
Running
Received
7/13/2019 22:22:09
Report deadline
7/20/2019 22:22:11
Estimated computation size
40,000 GFLOPs
CPU time
01:16:26
CPU time since checkpoint
---
Elapsed time
1d 09:31:36
Estimated time remaining
00:05:49
Fraction done
99.711%
Virtual memory size
5.33 MB
Working set size
8.95 MB
Directory
slots/5
Process ID
9472
Progress rate
2.880% per hour
Executable
tot5_507_windows_x86_64.exe
ID: 3472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMCOBILL

Send message
Joined: 4 Mar 19
Posts: 9
Credit: 434
RAC: 12
Message 3473 - Posted: 16 Jul 2019, 2:23:08 UTC - in response to Message 3470.  

Not sure how much file io there is when checkpointing, especially if on old slow hard drives, but this task (that got the credit) ran for over 3 days,
https://boinc.tbrada.eu/result.php?resultid=295454

It looks like it was checkpointing like there was no tomorrow, did it spend most of its time checkpointing rather than doing the calculation ?

Try changing your computing preferences to only checkpoint every 300 seconds and see if that improves things.


Changed to 300 seconds.
ID: 3473 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Project scientist
Avatar

Send message
Joined: 8 Feb 19
Posts: 162
Credit: 0
RAC: 0
Message 3474 - Posted: 16 Jul 2019, 15:11:21 UTC - in response to Message 3466.  

Indeed I messed up.
214932 are unique within the framework of this project
3 are duplicate
4 "others".

What is 4 "others"?
ID: 3474 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Testing Padls Total

©2019 Tomáš Brada