4e9a8f351b
try to fix a core from restarting a collection that was in the middle of dumping to disk.
Matt Wells
2015-04-22 16:07:16 -07:00
0656cc4c72
fix a core on seraph host #6
Matt Wells
2015-04-22 15:46:35 -07:00
b0b26126a5
fix parens bug for gbsortbyint:gbspiderdate) do not include ( or ) as part of the field value since they are associated with boolean syntax.
Matt Wells
2015-04-22 14:02:28 -06:00
7462b0cd84
gb -h fix
Matt Wells
2015-04-22 12:51:32 -06:00
00661287da
mysyn fixes
Matt
2015-04-22 08:34:29 -06:00
05fc660ef2
fix love<->like syn mapping from wiktionary.
Matt
2015-04-21 20:58:33 -06:00
a2feab9a4a
tap in some fixes for running the newly updated smokes for dealing with the new urls.csv format
Matt Wells
2015-04-21 15:20:57 -07:00
1dd3912ca0
default isr back on
mwells
2015-04-21 08:19:32 -06:00
8e5f57d677
take comments out
mwells
2015-04-21 08:14:54 -06:00
a7640dadc1
hop count bug fix when merging spiderdb lists and doing deduping. do not change hopcounts in spider request records.
mwells
2015-04-20 15:17:36 -06:00
e05dde5934
show the path depth of spidered urls in the logs
mwells
2015-04-19 16:17:30 -06:00
644ad28912
debugging the hopcount bug
Matt Wells
2015-04-19 15:51:29 -06:00
80f2584b5d
more new urls.csv fixes
Matt Wells
2015-04-15 18:38:29 -07:00
25aab18870
add crawl try # to urls.csv
Matt
2015-04-15 19:31:44 -06:00
11ea50935d
use new urls.csv only for GET /v3/crawl/download/token-collname_urls.csv version 3
Matt
2015-04-15 17:48:55 -06:00
ef42a9cf28
new urls.csv polish. moved columns around. added some new gbss fields, like spidered time.
Matt
2015-04-15 17:42:56 -06:00
fec347a7df
fix bug of partial facet counting.
Matt
2015-04-15 14:54:49 -06:00
496124da39
fix new urls.csv output
Matt
2015-04-15 12:53:43 -06:00
3191980f49
the new urls.csv format is ready. added url discovered time to gbssdocs so we know when we first found a url. also added to new urls.csv. fixed spiderdb list deduping so as not to discard the oldest spider request any more so we keep our discovered time in tact.
Matt
2015-04-15 12:13:27 -06:00
f0f8f0a967
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-04-14 16:27:28 -06:00
0c88ebba9b
removed buggy close least used linked list logic. was causing data corruption in reads and writes. go to urgent shutdown mode if on 10th try so gb will actually exit. do not startup if there is critical data corruption.
Matt Wells
2015-04-14 15:26:46 -07:00
61af961dfd
use m_sentToDiffbotThisTime in SpiderReply now too
Matt Wells
2015-04-14 15:23:12 -07:00
99454bc8ca
added gbssSentToDiffbotThisTime and gbssSentToDiffbotAtSomeTime to gbss docs to clarify if the url was sent to diffbot at this crawl time, or any time. makes it easier to see what is getting processed this crawl round.
Matt Wells
2015-04-14 14:50:39 -07:00
a92b158bc7
gbss doc oopsy fixes
Matt
2015-04-13 16:47:41 -06:00
040a604ec6
xmldoc back to o2
Matt
2015-04-13 14:49:44 -06:00
497131d359
fix gbssdocid bug better
Matt
2015-04-13 14:33:57 -06:00
3e5218c54c
fix gbssDocId:123456789, et al, query. will only work for docs indexed after applying this fix.
Matt
2015-04-13 14:13:16 -06:00
31ac1fa2b0
quick fix for spider status
Matt Wells
2015-04-13 12:16:55 -07:00
8d1e67be0a
the diffbot objects we index as their own separate doc should inherit the hopcount from the parent html doc.
Matt Wells
2015-04-13 12:12:43 -07:00
4a32c8308e
Merge branch 'diffbot-testing' into diffbot
Matt Wells
2015-04-13 12:08:06 -07:00
614e9215cd
update getSpiderStatusMsg() to always set *status. always show diffbotreply when doing crawlbottesting
Matt Wells
2015-04-13 12:06:22 -07:00
9f836dbf75
fix corruption of s_vbuf (gb version) in the hosts table.
Matt
2015-04-13 11:13:44 -06:00
4a43e1387e
better fixes for core from sig alarms
Matt
2015-04-13 10:28:43 -06:00
48ac8bf80f
fix udp linked list thing again
Matt
2015-04-13 10:13:59 -06:00
e6696a6937
fix some more
Matt
2015-04-13 10:08:01 -06:00
9feb070fe9
fix issue of not being able to exit gb when a disk read retry is taking forever.
Matt
2015-04-13 10:06:08 -06:00
f5a7423336
fix bug of never calling callback
Matt
2015-04-13 09:56:21 -06:00
47ed2a57ee
update log msg
Matt Wells
2015-04-13 07:49:57 -07:00
43ced700d0
calls NEWS BLOG
Matt
2015-04-12 12:33:09 -06:00
2814e3db37
show screenshots
Matt
2015-04-12 11:52:22 -06:00
994ba73007
log more when doing crawlbottesting-* tests
Matt Wells
2015-04-12 06:46:33 -07:00
56a46fd294
fix printing of facets when &header=0 so diffbot json output is still simple and correct.
mwells
2015-04-10 16:25:38 -06:00
a891fb7bdc
turn on indexing spider status docs for all diffbot CRAWLS on startup, whether it was off or on before.
Matt Wells
2015-04-10 14:35:41 -07:00
4dce44c976
fix log msg
Matt Wells
2015-04-10 13:31:46 -07:00
02aa138fbb
debug helper msg
Matt Wells
2015-04-10 13:27:06 -07:00
13d0361756
try to speed up host #4 on seraph
Matt
2015-04-10 09:20:18 -06:00
6b139e9eee
clarify jam ups
Matt Wells
2015-04-08 18:30:27 -07:00
e38cb8c080
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-04-08 18:08:10 -07:00
a7e48515fa
fix core when adding gbss for a force deleted doc.
Matt Wells
2015-04-08 18:07:32 -07:00
64bae224e0
fix core on the GI
Matt Wells
2015-04-08 16:05:32 -06:00
97d3b185c1
just use INCOMING udp slots/sockets for jam detection. this will highlight the slow nodes better.
Matt Wells
2015-04-08 15:52:43 -06:00
7fd4310106
don't include gbss headers if not gbss documents
Matt Wells
2015-04-07 22:05:11 -07:00
2997b3bb28
fix for skipping dead shards on tag re clookup
mwells
2015-04-07 14:36:47 -06:00
53a2d39afd
fix for calling callback of timedout udp slots
mwells
2015-04-07 14:18:42 -06:00
2114c40cda
fix not calling callback when udp reply times out. like for msg39 replies we need to timeout quickly.
Matt Wells
2015-04-07 12:38:35 -07:00
05a66cc367
fix bug of not able to get ip address because peeksize is too big.
Matt Wells
2015-04-07 12:29:19 -07:00
b08d12a11e
fix cores associated with new spider status docs.
Matt Wells
2015-04-07 10:33:54 -07:00
4ed8231222
upp max vfds again
Matt Wells
2015-04-07 09:56:04 -06:00
036fb4e0dc
crap, another oopsy fix
Matt
2015-04-06 14:43:58 -07:00
bc3335c434
more facet counting fixes
Matt
2015-04-06 14:38:38 -07:00
74fe3c5866
more facet counting fixes
Matt
2015-04-06 14:02:59 -07:00
1a262c8254
fixed oopsy
Matt
2015-04-06 13:51:19 -07:00
8326460e8f
fix counting of # docs that have facet field.
Matt Wells
2015-04-06 14:41:44 -06:00
330d9a9dbf
report max rounds reached, not max to process or crawl reached.
Matt Wells
2015-04-06 10:24:33 -06:00
bffaa09599
fix for the GI
Matt
2015-04-06 08:24:00 -06:00
de187dbb2b
documentation fix
Matt
2015-04-03 16:00:04 -06:00
8433c49aa9
make sure we index a spider status doc for each diffbot object. that way we can tell if diffbot objects are deduping, how they are changing over time, etc.
Matt
2015-04-03 14:59:09 -06:00
dad1cb15f4
fix excessive looping when calling makeCallbacks() on niceness 1 or above when none are available.
Matt
2015-04-03 12:12:58 -06:00
c991a2dcdd
try to ameliorate the udp slot jamming issue.
Matt
2015-04-03 10:43:11 -06:00
c2567ad244
a hopeful fix for host #0 always crashing from streaming socket timeouts.
Matt
2015-04-02 15:17:49 -06:00
2ce107e4be
keep track of how many times the host exited/cored as an exponent to the 'x' in the hosts table. this way we can detect hosts that have restarted many times and fix them.
Matt
2015-04-01 16:28:58 -06:00
e583850e40
fix core when searching bogus collection.
Matt
2015-04-01 15:30:59 -06:00
94a8210586
added CSV to output dropdown. show all json fields for spider status doc csv files. support spider status docs in csv output.
Matt
2015-04-01 13:53:03 -06:00
f26c9d609b
one more qa test fix for spider status docs
Matt
2015-04-01 12:47:32 -06:00
5e46262cb2
more fixes for qa'ing of new spider status docs
Matt
2015-04-01 12:03:17 -06:00
10a31783bb
fixes to pass internal qa tests in light of gbss (spider status doc) changes and other things. had to make xmldoc.o -O2 instead of -O3 to fix strange bug.
Matt
2015-04-01 11:20:36 -06:00
6b293f17e6
now show "totalDocsWithField" for each facet, so we know how many docs had that field, with any particular value, so we can do tf/idf type things.
Matt
2015-04-01 09:16:42 -06:00
47f6d9f414
clean out rebuild trees/buckets too
mwells
2015-03-21 22:42:49 -06:00
e99b2f0a65
added RdbBuckets::cleanBuckets() corresponding to RdbTree::cleanTree() to remove keys from deleted collections at startup.
Matt
2015-03-21 22:28:34 -06:00
000c5d67e9
do not index xml docs' body for custom crawls or when indexbody is turned off.
Matt
2015-03-21 09:21:44 -06:00
9f42a6d5ff
fix indexing of spider status docs
Matt Wells
2015-03-20 18:08:39 -07:00
7d82a5ca69
try to get diffbot reply info first before making spider status doc
Matt Wells
2015-03-20 17:44:21 -07:00
07d13541ed
emergency fixes for corrupt tagdb tag id
Matt Wells
2015-03-20 17:21:52 -07:00
62bbede498
try to fix strange unknown tagid core
Matt Wells
2015-03-20 14:20:40 -07:00
7343c40e98
fix a couple status doc fields.
Matt Wells
2015-03-20 12:51:36 -06:00
1a80cb1b5d
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
mwells
2015-03-20 12:45:24 -06:00
7de9f6940b
documentation for new gbss spider status doc fields.
mwells
2015-03-20 12:43:21 -06:00
70bdc97bfd
update README.md
Matt
2015-03-19 23:31:09 -06:00
afa14e84a1
fixes for generating spider status docs. and displaying them.
mwells
2015-03-19 23:24:36 -06:00
fc0b3e7743
url filters fixes.
Matt
2015-03-19 18:17:26 -06:00
ed5fe6d284
more bug fixes from 'delete' column addition to url filters.
Matt
2015-03-19 18:05:36 -06:00
4b1dcfa068
fix isrssext bug
Matt
2015-03-19 17:28:19 -06:00
90456222b6
now we add the spider status docs as json documents. so you can facet/sortby the various fields, etc.
Matt
2015-03-19 16:17:36 -06:00
d5560c3e77
final fix for new delete column
mwells
2015-03-17 21:47:31 -06:00
c9b14b1b89
fix 'delete' checkbox in url filters. fix reading in of xml conf files that have </> tags.
Matt
2015-03-17 21:20:27 -06:00
dfc069aaa1
do away with filtered/banned spider priorities. add checkbox to signify force deletes to remove urls from index if in the index, or not allow them in.
mwells
2015-03-17 20:27:23 -06:00
dea534827e
langidbits init bug leftover from searchinput reset memset fix i think.
Matt
2015-03-17 15:04:31 -06:00