Closed Bug 627635 (localStorageIO) Opened 13 years ago Closed 5 years ago

Make remote localStorage not block the UI main thread on disk IO

Categories

(Core :: Storage: localStorage & sessionStorage, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
fennec - ---

People

(Reporter: cjones, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: main-thread-io, perf, Whiteboard: [at-risk][tech-bug])

Attachments

(2 files, 1 obsolete file)

It came out in bug 622112 that our localStorage implementation blocks the content-process main thread on the chrome-process main thread, which is not great but OK.  But, the chrome-process main thread then appears to be blocked on the localStorage request, which involves grabbing the SQLite mutex, which means blocking on disk.  That's not palatable for fennec.
Allowing web content to block fennec's UI thread on disk IO is ... not palatable.  I think we either need to not do this or disable localStorage for 4.0.  We've never shipped a fennec with it enabled, and it's being phased out by IndexedDB, so I wouldn't lose too much sleep over punting it to a 4.x or later.

I don't know what's involved in avoiding UI-thread disk IO.
tracking-fennec: --- → ?
Summary: Make remote localStorage not block the UI main thread on disk → Make remote localStorage not block the UI main thread on disk IO
Honza, the following stack from bug 622112 is just the flushing timer running.  It looks like a prime candidate to move to a background thread, but I don't know anything about whether storage connections can be used off the main thread.

#0  mozilla::storage::SQLiteMutex::lock (this=0x7fffe08c6928) at
/home/cjones/mozilla/mozilla-central/storage/src/SQLiteMutex.h:128
#1  0x00007ffff4a40188 in SQLiteMutexAutoLock (this=0x7fffffffd290, aMutex=...)
at /home/cjones/mozilla/mozilla-central/storage/src/SQLiteMutex.h:183
#2  0x00007ffff4a3eaec in mozilla::storage::Connection::BeginTransactionAs
(this=0x7fffe08c68f0, aTransactionType=0) at
/home/cjones/mozilla/mozilla-central/storage/src/mozStorageConnection.cpp:930
#3  0x00007ffff37db6ee in mozStorageTransaction (this=0x7fffffffd300,
aConnection=0x7fffe08c68f0, aCommitOnComplete=0, aType=0) at
../../dist/include/mozStorageHelper.h:77
#4  0x00007ffff417fa46 in nsDOMStoragePersistentDB::FlushTemporaryTables
(this=0x7fffe08c4810, force=false) at
/home/cjones/mozilla/mozilla-central/dom/src/storage/nsDOMStoragePersistentDB.cpp:535
#5  0x00007ffff4179a6b in nsDOMStorageDBWrapper::FlushAndDeleteTemporaryTables
(this=0x7fffe08c4810, force=false) at
/home/cjones/mozilla/mozilla-central/dom/src/storage/nsDOMStorageDBWrapper.cpp:104
#6  0x00007ffff416dcc9 in nsDOMStorageManager::Observe (this=0x7fffe02cac00,
aSubject=0x0, aTopic=0x7ffff58f2205 "domstorage-flush-timer", aData=0x0) at
/home/cjones/mozilla/mozilla-central/dom/src/storage/nsDOMStorage.cpp:486
#7  0x00007ffff4f0f771 in nsObserverList::NotifyObservers (this=0x9db7c8,
aSubject=0x0, aTopic=0x7ffff58f2205 "domstorage-flush-timer", someData=0x0) at
/home/cjones/mozilla/mozilla-central/xpcom/ds/nsObserverList.cpp:130
#8  0x00007ffff4f1140e in nsObserverService::NotifyObservers
(this=0x7fffe00f5160, aSubject=0x0, aTopic=0x7ffff58f2205
"domstorage-flush-timer", someData=0x0) at
/home/cjones/mozilla/mozilla-central/xpcom/ds/nsObserverService.cpp:182
#9  0x00007ffff416d837 in nsDOMStorageManager::Observe (this=0x7fffe02cac00,
aSubject=0x7fffe0c1de70, aTopic=0x7ffff5b5393d "timer-callback", aData=0x0) at
/home/cjones/mozilla/mozilla-central/dom/src/storage/nsDOMStorage.cpp:449
Josh, I commented on bug 622112 comment 30, but those suggestion are just a partial fix.

AFAIK, sqlite is not very thread safe, or at least wasn't in times I last witnessed attempts to use it on multiple threads.

I would more tend to use async API for this, that would accomplish the premiss of not having IO on the main thread.

But first I'd like to know why it takes so long time to acquire the mutex, when there is just a single connection bound to a single database file on disk and that all accessed just on the main thread.
SQLite is threadsafe (in that it serializes access to the database).  The mozIStorageConnection object is also threadsafe, but I'd prefer you to use an asynchronous statement to do background thread work (this lets Storage manage the thread for you).

I don't have any comments as to why it takes so long to get the mutex.
aza, can you take a look?
Assignee: nobody → azakai
tracking-fennec: ? → 2.0+
^azakai, of course.
Alon, also check bug 622112 comment 30, though those suggestion are only a partial solution.
Some questions:

1. How common is localStorage on the web?
2. How big of a delay is caused? (hard for me to tell from bug 622112 whether localStorage is the main issue or not)
3. Can we estimate how much of a speedup the fixes in bug 622112 comment 30 would get us?
(In reply to comment #8)
> 3. Can we estimate how much of a speedup the fixes in bug 622112 comment 30
> would get us?

Avoiding call to GetTransactionInProgress in MaybeCommitInsertTransaction will save acquiring the lock for every GetItem not preceded by SetItem.

The second optimization should actually be: don't open a transaction until a scope whom data has been modified has been found.  That will save acquiring when a page only reads localStorage.


In general, I don't think we can do a lot here... localStorage API is intended to be fully synchronous so to satisfy all request for Get and Set we have to block.

I just don't know sqlite well to say why there is one global lock for all IO ops that, when accessing one file, prevents accessing other completely separated files.  It just doesn't make sense to me.
can we do the sets+gets to an in-memory hash, then flush to sql when the tab closes?
(In reply to comment #10)
> can we do the sets+gets to an in-memory hash, then flush to sql when the tab
> closes?
I think that's an awfully big change this late in the 2.0 release cycle.  I sure wouldn't like to see that go in now.
cjones how bad is this?  is it worth just pulling the plug on localstore like we did with indexedDB?
What I don't know is, how many sites rely on localstorage.  We know some do, like tbpl.  google.com uses it, but I doubt it's broken without it.  Comment 8 asks this too.

What we're trading for breaking those sites is potentially huge UI latency, O(second) judging by sync-disk-IO problems we've had elsewhere.  This will happen if the disk is busy and/or we're on a file system like the Galaxy S***'s.  The possibility of this happening when loading a site as common and innocuous as google.com concerns me.  This UI latency is also the under the control of web content.  The more web content uses localstorage, the more lags we get.  That makes fennec look bad.

I don't know what we gain in terms of site compat by leaving localstorage enabled.  That's the key bit of data we'd need to make a good decision.  Short of that, ... I guess we'd have to browse a bunch of sites known to use localstorage and see how bad things are in the common case, and if the experience is acceptable, hope we don't hit bad cases too much in the wild.
(In reply to comment #10)
> can we do the sets+gets to an in-memory hash, then flush to sql when the tab
> closes?

Please don't.  There is already a hash table used to store values, mirroring data, but the final goal is to get rid of it and make all processes over the data be fully under the sqlite control.  We need more then just a simple hash table to have all capabilities of localStorage (DOM storage in general) working.

The fix should figure out why we block and try to avoid the block as much as possible.

I can do some research with optimizations I suggested earlier.
I tried to find websites that use localStorage. The following do:

google.com does (but not mobile)
espn (but not mobile)
cnn (but not mobile)
orbitz
lonelyplanet
sfchronicle

but it looks like the majority of popular websites don't use it. And even when they do, their mobile versions don't.

As for how important localStorage actually is to those websites, it's hard to tell. I don't notice any immediate difference when using the websites mentioned above without localStorage - they don't obviously break - but possibly some minor functionality (that is noticed later on) is lost.

Given that, my recommendation is to disable localStorage for Fennec for 4.0. Definitely not worth a perf hit, and probably not worth the risk of any serious work (unless there is something simple that can completely fix this).
Any blog that uses disqus for its commenting engine uses localstorage, but I also don't know how it's used therein.
Attached patch experimental1 (obsolete) — Splinter Review
This is an experimental patch that should lower the chance of hitting the lock.  If anyone can test with it let me know.  I can also provide a try build with this patch if anyone wants.
Attached patch experimental1.1Splinter Review
Slightly simplified and a bit fixed from the previous one.
Attachment #507715 - Attachment is obsolete: true
I'm not sure how to test this. I can create a page with localStorage and a page without, but I don't notice any difference in performance.

I tried the websites mentioned in the parent bug as well, but 2 out of the 3 websites mentioned (bugzilla and facebook) don't even use localStorage.

Are we sure that localStorage is actually a significant part of the ~1 second hang mentioned in the parent bug?
(In reply to comment #19)
> I'm not sure how to test this. I can create a page with localStorage and a page
> without, but I don't notice any difference in performance.

On device?  What access pattern does your page with localStorage use?

> Are we sure that localStorage is actually a significant part of the ~1 second
> hang mentioned in the parent bug?

Not the least bit, sorry if I implied otherwise.  It's synchronous, UI-thread disk IO though, and that's guaranteed to bite us.  The question is will it bite us infrequently enough to leave it enabled for 4.0.
(In reply to comment #20)
> (In reply to comment #19)
> > I'm not sure how to test this. I can create a page with localStorage and a page
> > without, but I don't notice any difference in performance.
> 
> On device?  What access pattern does your page with localStorage use?

Galaxy S and Linux desktop.

My test page does a lot of writes and reads to localStorage.

> 
> > Are we sure that localStorage is actually a significant part of the ~1 second
> > hang mentioned in the parent bug?
> 
> Not the least bit, sorry if I implied otherwise.  It's synchronous, UI-thread
> disk IO though, and that's guaranteed to bite us.  The question is will it bite
> us infrequently enough to leave it enabled for 4.0.

Yes, good question.
IPC profiling data from an android device can be seen in bug 629513 comment 6 / bug 629513 comment 8. Web storage does indeed lead to a significant slowdown on google.com (nonmobile).
Alon, I'm curious why SetValue and RemoveValue have so short latency compared to GetValue.  Could it be, that google first calls to GetValue that leads to initial DB load for the whole domain to a temp table and only this first call is therefor that slow?  Or is that just called much more often then Set and RemoveValue?

If the former, we could pre-cache the DB on the chrome process somehow, or search for an opt in a more narrowed space.
(In reply to comment #23)
> Alon, I'm curious why SetValue and RemoveValue have so short latency compared
> to GetValue.  Could it be, that google first calls to GetValue that leads to
> initial DB load for the whole domain to a temp table and only this first call
> is therefor that slow?  Or is that just called much more often then Set and
> RemoveValue?
> 

The last column shows the amount of times it is called. So GetValue was called 11 times, SetValue 2 times and RemoveValue 3 times. However, even accounting for that, GetValue takes much more time - so your first explanation seems most likely.

However, it is also possible there was some noise in that run, so we should get more data. I am working on a better profiling tool now.
Attached file profiling data
Better profiling data, of google.com load (nonmobile).

Again GetValue is quite high, ~0.5 seconds.
And what is the distribution of duration among each call?
Unfortunatelly I don't have a test env for this.  Android port for my Freerunner needs some fixes to be even able to connect wifi.  I don't know when I get to make it work, so please give me the most information you can to move this forward.
Honza, we're in the end game for Fennec. What's the plan for this? Also, what do you need to move forward?
Whiteboard: [at-risk]
I don't think it's wise to take this in 2.0, FWIW.  This is going to be a big enough change that I'd think it would need beta coverage, but we just tagged our last beta.
based on Shawn's comment, moving this to 2.0next
tracking-fennec: 2.0+ → 2.0next+
(In reply to comment #30)
> based on Shawn's comment, moving this to 2.0next
Although, mobile may want to disable localStorage like in the past.
(In reply to comment #26)
> And what is the distribution of duration among each call?

I am seeing less of a problem now, not sure why. But here is some data from loading google.com (nonmobile):

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.18600 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00100 seconds, mainThread == 1

So just one of those ended up significantly slow.
FWIW, loading my gmail account on my galaxy s*** is very painful, many several-second UI pauses.  Didn't profile, but localStorage would be my first guess at why UI blockage.  The galaxy s*** is kind of a worst-case though.

It would be interesting for other folks to try gmail on their less s***ty phones.
tracking-fennec: 2.0next+ → 2.0+
after discussing with blassey and stuart, this is something that is serious for many of the popular devices we are targeting.
oh hell.  2.0next.  wrong bug.
tracking-fennec: 2.0+ → 2.0next+
(In reply to comment #33)
> FWIW, loading my gmail account on my galaxy s*** is very painful, many
> several-second UI pauses.  Didn't profile, but localStorage would be my first
> guess at why UI blockage. 

cjones, I am trying gmail on galaxy s to try to figure out this issue. I don't think I see the same degree of pain as you do so I may be doing something different. I tried both mobile and standard HTML views, and both accounts with no emails in the inbox and with 20-30 emails.

The only interesting thing I see in the profiles I've generated is that we spend a lot of time in nsRefreshDriver, especially when there are lots of emails in the account (1.3130 seconds over 4 page loads). But even that doesn't seem so bad, no several-second UI pauses in particular. Are you still seeing those?
(I am hoping fixing bug 634666 got rid of the UI pauses.)
Should note that I saw the pain on 2.1.  Haven't tried since I upgraded to 2.2; other stuff seems less s***ty with 2.2 so I wouldn't be surprised if gmail is too.
tracking-fennec: 2.0next+ → 6+
Are we certain this is still an issue? Or should we collect new data?
Assignee: azakai → nobody
Ran the profiling code on this and saw:

IPC|RecvGetValue|main 0.0280 total seconds, 0.0031 mean seconds, 9 counted

when loading google.com (desktop site) in Fennec on a Droid X.
(In reply to comment #40)
> Ran the profiling code on this and saw:
> 
> IPC|RecvGetValue|main 0.0280 total seconds, 0.0031 mean seconds, 9 counted
> 
> when loading google.com (desktop site) in Fennec on a Droid X.

Seems to not be a problem, if I reading that right.
tracking-fennec: 6+ → -
Blocks: 698500
I'm seeing animation jank on desktop Firefox that correlates with writes to webappsstore.sqlite.

If confirmed, this is a P1 for Firefox. Honza, do you have time to pick this back up again?
OS: Linux → All
Hardware: x86_64 → All
(In reply to Dietrich Ayala (:dietrich) from comment #42)
> Honza, do you have time to pick this
> back up again?

First I'd like to investigate if we could move all this stuff to LevelDB and see how that would work (bug 687566).

If that would have take too long to finish, I would try to get to this soon, but not before Aurora merge.
Blocks: 682602
Alias: localStorageIO
Depends on: 687566
Whiteboard: [at-risk] → [at-risk][see comment #43 for plan]
Thanks Honza.

How would moving to LevelDB specifically reduce the main thread IO caused by localStorage? Are all writes done off-thread? Or does it mitigate the problem by allowing us to write less and less often?
Blocks: StorageJank
(In reply to Dietrich Ayala (:dietrich) from comment #44)
> Thanks Honza.
> 
> How would moving to LevelDB specifically reduce the main thread IO caused by
> localStorage? Are all writes done off-thread? Or does it mitigate the
> problem by allowing us to write less and less often?

I'm adding Matthew who's working on LevelDB integration to answer at least the off-main-thread question.  I personally don't remember LevelDB would use background threads, but can't say for sure for my self.

According the tests it seems that LevelDB should simply be much faster.  I want to investigate first, but unfortunately we still don't have even a prototype of LevelDB implementation in the mozilla tree to start playing with it.

If that would take even longer time, I would have to look for conventional solutions that would go against my future plans, which was reworking the code to be both much simpler and faster.
(In reply to Dietrich Ayala (:dietrich) from comment #44)
> How would moving to LevelDB specifically reduce the main thread IO caused by
> localStorage? Are all writes done off-thread? Or does it mitigate the
> problem by allowing us to write less and less often?
Writing less doesn't really mitigate it; it just makes it harder to notice until you have a really bad case.

(In reply to Honza Bambas (:mayhemer) from comment #45)
> According the tests it seems that LevelDB should simply be much faster.  I
> want to investigate first, but unfortunately we still don't have even a
> prototype of LevelDB implementation in the mozilla tree to start playing
> with it.
LevelDB should be faster because it's designed to be a simple key-value store, which is exactly what `localStorage` is.  SQLite is a bit more complex than we need.
(In reply to Honza Bambas (:mayhemer) from comment #45)
> (In reply to Dietrich Ayala (:dietrich) from comment #44)
> > How would moving to LevelDB specifically reduce the main thread IO caused by
> > localStorage? Are all writes done off-thread? Or does it mitigate the
> > problem by allowing us to write less and less often?
> 
> I'm adding Matthew who's working on LevelDB integration to answer at least
> the off-main-thread question.  I personally don't remember LevelDB would use
> background threads, but can't say for sure for my self.

LevelDB does use background threads to improve performance, but writes can still block when buffers fill up and the background thread needs to catch up.  So bad cases can still occur.
Blocks: 701880
As stopgaps, is it worthwhile to do any or all of the following:

1) Use a more relaxed PRAGMA synchronous on the localStorage DB;
2) Use SQLite's asynchronous (writes on background threads) VFS or one that we write ourselves;
3) Some other way of moving IO off the main thread that I'm not aware of because I'm new to all of this...

?  At least option (2) sounds quite similar to what LevelDB already does.
(In reply to Nathan Froyd (:froydnj) from comment #49)
> 1) Use a more relaxed PRAGMA synchronous on the localStorage DB;

This is likely doable if we consider dataloss as a nonimportant issue for localStorage. Actually data is set by the page, I expect any page has already some sort of fallback, in case the data is not yet in the localStorage. On corruption it could just throw away the db. Cookies similarly use synchronous=OFF.
That way writes would be much faster, since it'd never fsync.

> 2) Use SQLite's asynchronous (writes on background threads) VFS or one that
> we write ourselves;

This is probably too expensive if levelDB is behind the corner.

> 3) Some other way of moving IO off the main thread that I'm not aware of
> because I'm new to all of this...

One may use partitioned tables with a part in memory, and handle all of that with magical queries. Well, we did this in Places for Firefox 3.5 and it is a nightmare.

(In reply to Dietrich Ayala (:dietrich) from comment #44)
> How would moving to LevelDB specifically reduce the main thread IO caused by
> localStorage?

Take these results with care, since obviously the benchmarks are made to be largely in favor of levelDB, but it is really good in what it does (simple hashing with sequential reads/writes)
http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html
it may also use openUnsharedDatabase and exclusive locking, to avoid the shared cache and the concurrent writers checks.
These changes are simple and known to improve performances.
(In reply to Marco Bonardo [:mak] from comment #50)
> (In reply to Nathan Froyd (:froydnj) from comment #49)
> > 1) Use a more relaxed PRAGMA synchronous on the localStorage DB;
> 
> This is likely doable if we consider dataloss as a nonimportant issue for
> localStorage.

As a side note, levelDB in synchronous mode may perform worse than SQLite, so it's probably something that we'll have to do regardless, when moving to it.
IMHO, dataloss is totally ok for localStorage.
Depends on: 704933
I've filed Bug 704933 to investigate these easy improvements, I hope someone from the perf team (or anyone else) can take it.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #53)
> IMHO, dataloss is totally ok for localStorage.

Why? Unlike sessionStorage, which is understood to be transient and volatile, localStorage is designed to be persistent.
(In reply to Dietrich Ayala (:dietrich) from comment #55)
> Why? Unlike sessionStorage, which is understood to be transient and
> volatile, localStorage is designed to be persistent.

Dataloss here is intended as database corruption. All the data in localStorage comes from the web and can be added back again by the web app.
(In reply to Marco Bonardo [:mak] from comment #56)
> Dataloss here is intended as database corruption. All the data in
> localStorage comes from the web and can be added back again by the web app.
That's not true at all.  Think offline apps.  Think of pages that store user preferences in localStorage because cookies can easily be deleted.
(In reply to Shawn Wilsher :sdwilsh from comment #57)
> That's not true at all.  Think offline apps.  Think of pages that store user
> preferences in localStorage because cookies can easily be deleted.

My point is that those pages have a fallback for when we don't have any of their data, otherwise they'd just not work. In case of a dataloss nothing is really lost, apart the fact the page will have again to populate the storage, that sure, may be an annoyance in case of offline apps. But even now we are surely not doing anything we can to protect from that, we use sync=normal.
If we really don't want to move to the complete removal of fsyncs, we may still move to WAL+NORMAL, with a small journal size.
At this point if we disallow levelDB from working on mainthread, I don't see that as a viable solution for localStorage, then we need an alternative path to reduce fsyncs.
Blocks: 701872
Depends on: 736144
Whiteboard: [at-risk][see comment #43 for plan] → [at-risk][see comment #43 for plan][Snappy]
e10s is not in Firefox desktop, so not covered by snappy.
Whiteboard: [at-risk][see comment #43 for plan][Snappy] → [at-risk][see comment #43 for plan]
dup of either 807021 or 600307
Honza: Bug 807021 doesn't fully cover this since bug 807021 is only about writes. There is no reason we should ever need to do reads on the main thread if someone in a child process accesses localStorage since we can easily just block the child process until we've asynchronously read the data.

Likewise, I would imagine bug 600307 doesn't fully solve the read problem here, does it?

I do think that waiting for bug 600307 before fixing this is the right thing to do though.
(In reply to Jonas Sicking (:sicking) from comment #64)
> Honza: Bug 807021 doesn't fully cover this since bug 807021 is only about
> writes. There is no reason we should ever need to do reads on the main
> thread if someone in a child process accesses localStorage since we can
> easily just block the child process until we've asynchronously read the data.
> 
> Likewise, I would imagine bug 600307 doesn't fully solve the read problem
> here, does it?

Sure it does.  On the parent process we do the load on background.  Actually, it behaves the same way, as there would be just a single process.  Preloads happen on the background thread, if sync load happens, though, it blocks (there is no way around it, even for IPC!).  

According existing telemetry domain data reads are fast.


The first telemetry from desktop Firefox using the patch from 600307 shows 75 - 100% of preloads makes it before localstorage access.  SSD makes it better, HDD makes it a bit worse.  Load times about 20ms when blocking.

> 
> I do think that waiting for bug 600307 before fixing this is the right thing
> to do though.

This is a dupe of bug 600307.
> Preloads happen on the background thread, if sync load happens, though, it
> blocks (there is no way around it, even for IPC!).  

That's the thing. For IPC there is a way around it!

On the child side you make a blocking call to the parent. But when there's a call coming from a child process, there is no need to do anything blocking in the parent! You can asynchronously load the data and while the load is happening the child process remains blocked. Once the data has been loaded, you can send back a response to the child which unblocks it.

So from the child process point of view, it is a blocking call. But in the parent process nothing blocking happens.

Of course, if localStorage is *used* in the parent process we'd have to make a blocking call. That we can't do anything about. But we shouldn't need to block when localStorage is used in a child process.

That's what this bug is about. And that part is not fixed in any of the dependent bugs.
(In reply to Jonas Sicking (:sicking) from comment #66)
> > Preloads happen on the background thread, if sync load happens, though, it
> > blocks (there is no way around it, even for IPC!).  
> 
> That's the thing. For IPC there is a way around it!

Do you have a code example?  Because I don't think what you suggest is going to work or I simply am not getting it.  How can I not block the parent?  That is not clear to me.

Thanks.
Content processes should communicate directly with a non-main thread in the parent process.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #68)
> Content processes should communicate directly with a non-main thread in the
> parent process.

How?
Child process main thread creates an IPC channel for PStorage or something on its main thread, and then the parent process opens and binds that channel to a non-main thread that services the requests.
StorageParent::RecvSomeSyncRequest(const &in_arg, &out_arg) will be called on the main thread and will block the main thread until value of the |out_arg| is known.

I don't know about a way to make sync requests be handled on non-main thread.
We would need to move the StorageParent code to run on an off-main-thread, with a dedicated IPC channel.  Nothing would be different from the client (child) side.
Chris: Is there no way in ipdl to send a message from the child to the parent and then block until it gets a response from the parent, while letting the parent produce a response asynchronously? I.e. let the parent send a separate message in order to deliver the response or some such?

If that's not possible, that seems like something we should add to ipdl. Allowing a child to block while the parent runs some slow operation seems like one of the big benefits we could get from process separation.
Parent can spin event loop while creating the response.
That is kind of ok case for event loop spinning.
Right, spinning the event loop is the only way to do that.  The semantics of half-sync are messy and not at all clear, so I don't want to add it.  Spinning the event loop achieves the same effect.
Ah, is that what the message manager does? IIRC the message manager has the ability to let the parent process create a result asynchronously.

So yeah, one "simpler" option here would be to let the parent process, when it gets a localStorage request from the child, call the preload API added in bug 600307, and then spin the event loop until whatever data was requested is available. Then it could fetch that data without ever causing blocking IO in the parent.
Message manager doesn't have such functionality, but scripts in the parent can ofc spin the event loop.
Ok, spinning the event loop works for me, and should make this bug relatively easy to fix as long as we have a way to get a callback when a particular origin has been preloaded into the localStorage.
Thanks for all suggestions guys.  I see two ways here:

- spin the even loop on the parent (super-easy to do), however it may cause unpredictable behavior when something underlying changes
- have a dedicated channel handling on a non-main thread (no idea how to do it), I am more fun of this one when easy to do
(In reply to Honza Bambas (:mayhemer) from comment #79)
> - spin the even loop on the parent (super-easy to do), however it may cause
> unpredictable behavior when something underlying changes
What underlying? You'd be spinning event loop very close to the place where data is received from the
other process, so usually very close to the normal main event loop spin.
And in case of showModalDialog nested spinning is legitimate anyway.
As far as I see spinning event loop in this case isn't really a problem. It doesn't cause similar
issues as alert() or sync XHR:
(In reply to Honza Bambas (:mayhemer) from comment #79)
> Thanks for all suggestions guys.  I see two ways here:
> 
> - have a dedicated channel handling on a non-main thread (no idea how to do
> it), I am more fun of this one when easy to do

This is actually pretty straightforward and is definitely the better approach, but I don't know what assumptions in the localstorage code might be violated by running on a non-main thread.  I or bent can help with the IPC boilerplate.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #81)
> This is actually pretty straightforward and is definitely the better
> approach, but I don't know what assumptions in the localstorage code might
> be violated by running on a non-main thread.  I or bent can help with the
> IPC boilerplate.

Can I let the *child* receive messages on other then the main thread somehow?  It would make the new code much much simpler and also immediately fix the issue of blocking parent's main thread.

Thanks for help.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #83)
> Yes, absolutely.

Can you please reveal how?  (Code example, API I can use to set the thread...)
Here's the unit test for the feature

http://mxr.mozilla.org/mozilla-central/source/ipc/ipdl/test/cxx/TestOpens.cpp

The test starts in the parent process main thread, then sends a message to the child that initiates
 - in the child, creating a new channel, PTestOpensOpened
 - in the child, creating a new thread to process messages for that channel
 - binding the channel to the off-main thread
 - in the parent, accepting the channel allocation message
 - in the parent, also creating a non-main thread to process messages for the new channel and binding it to that thread

Then there's some IPC chatter on the newly-created channel.
Honza: I don't understand why we would want to do anything off the main thread in the child process though? Can you explain?
(In reply to Jonas Sicking (:sicking) from comment #86)
> Honza: I don't understand why we would want to do anything off the main
> thread in the child process though? Can you explain?

Yes.

I will describe the new implementation in the bug 600307 now:

In a single process mode when a window is opening we tell the storage background I/O thread to do an async preload of the storage cache.  It SELECTs and then for each key+value calls LoadItem(key, value) method of the cache.  When done it calls LoadDone() that marks the cache as loaded (also can be understood as "valid").  These calls are made on the background I/O thread.

When a JS script accesses localStorage sooner then the preload has finished, we block the main thread until LoadDone call (we wait via monitor.wait(), LoadDone calls monitor.notify()).  There is no main thread event looping.

In the IPC case when we get to the situation we have to wait for the async preload to finish there is no way (I didn't find one...) to just loop the IPC message queue to receive all async messages from the parent we expect to come.  I cannot do monitor.wait() on the main thread since main thread has to spin the event loop to process the messages from parent.  And I don't want to spin the loop (in what ever cost, we know why).

I solved this by doing a sync call to parent to do a preload of all remaining unloaded data synchronously.


Thus, if the child process would process messages on a non-main thread, then I can just call monitor.wait() on the main thread to for all data to be received from the parent.  I could remove the relatively complicated sync load IPC code as well.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #85)
>  - binding the channel to the off-main thread

Thanks for this example.

My situation is backwards according the IPC connection creation, it starts on the child.  So I need to initialize the child first on non-main thread.

The trick in the test is in these lines I presume:

child opens PTestOpensOpened;

virtual PTestOpensOpenedChild*
    AllocPTestOpensOpened(Transport* transport, ProcessId otherProcess) MOZ_OVERRIDE;

When is AllocPTestOpensOpened called?
That example does initiate the new IPC connection from the child side.

The channel is allocated from this call

http://mxr.mozilla.org/mozilla-central/source/ipc/ipdl/test/cxx/TestOpens.cpp#157

which is the interface generated from |child opens PTestOpensOpened|.

That function does some IPC magic behind the scenes.  When it's finished, then your  AllocPFoo*() functions will be called in both the parent and child processes and handed tokens that refer to the new IPC channe.  Those methods, implemented by you, are responsible for Open()ing the channel (which binds it to the thread it was opened on).

For that part you'll just want to copy the boilerplate ;).
Chris, thanks very mach for suggestions!

I decided to do this as a followup after bug 600307 landing.
Depends on: 600307
Note: after we land 600307 the situation will much improve.  However, there still is a chance to be blocked on a corner case synchronous load of the cache (comment 65).  It will be completely removed by work on this bug.
(This is one of the sources of b2g jank that we prefer not to talk about.)
Blocks: b2g-v-next
Just need to fix.
Whiteboard: [at-risk][see comment #43 for plan] → [at-risk][see comment #43 for plan][tech-bug]
Honza, what's current situation here, did anything change from comment 91 after the new implementation landed? What may we still do here? Comment 43 is clearly no more the plan, having a plan, even if future, would be useful.
Looks like bug 736144, the only dependency, is blocked on getting more telemetry.
Flags: needinfo?(honzab.moz)
Whiteboard: [at-risk][see comment #43 for plan][tech-bug] → [at-risk][tech-bug]
I don't work on this bug just because I'm really pretty occupied with network stuff (http cache rewrite).

Blocking of the main thread on the parent may still happen, however with much lower probability then before 600307.  

What we need is to receive IPC messages on the parent process on a non-main thread - that's all - nothing more to do!  I have an idea how to do it, but IPC API is hard.  

If there is anyone that could do it quickly then please do it.  I may not get to this at least a month, I can say.
Flags: needinfo?(honzab.moz)
I don't work on this bug just because I'm really pretty occupied with network stuff (http cache rewrite).

Blocking of the main thread on the parent may still happen, however with much lower probability then before 600307.  

What we need is to receive IPC messages on the parent process on a non-main thread - that's all - nothing more to do!  I have an idea how to do it, but IPC API is hard.  

If there is anyone that could do it quickly then please do it.  I may not get to this at least a month, I can say.
Thanks, I was going through the StorageJank bugs and trying to update their status and ensure there's a plan for them, this is good information to have here.
Honza, is comment 95 still relevant, should we just close this as WFM and file a separate bug for the IPC problem, since it's not related to main-thread IO?
Flags: needinfo?(honzab.moz)
(In reply to Marco Bonardo [::mak] from comment #98)
> Honza, is comment 95 still relevant, should we just close this as WFM and
> file a separate bug for the IPC problem, since it's not related to
> main-thread IO?

I'm sure Honza is ok with me answering this question...
Yeah, that comment is still relevant and I'm actually working on fixing the problem in bug 1350637.
Flags: needinfo?(honzab.moz)
Depends on: 1350637
(In reply to Jan Varga [:janv] from comment #99)
> (In reply to Marco Bonardo [::mak] from comment #98)
> > Honza, is comment 95 still relevant, should we just close this as WFM and
> > file a separate bug for the IPC problem, since it's not related to
> > main-thread IO?
> 
> I'm sure Honza is ok with me answering this question...
> Yeah, that comment is still relevant and I'm actually working on fixing the
> problem in bug 1350637.

Thanks, yes, I agree.
Please add the "Perf" key word.
Keywords: perf
See Also: → 1428169
Priority: -- → P2
Component: DOM → DOM: Web Storage

Marking works-for-me as fixed by bug 1350637 which :janv noted in comment 99 was the fix for avoiding the parent process main thread. (Things now go via PBackground in the legacy implementation, and will also go via PBackground when LocalStorage NextGen is fully landed and enabled.) The main thread of whatever content process is using localStorage where the preload did not complete in time, but there's no avoiding that.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.