e254118b5fprint numbers as strings when printing the csv
Matt Wells
2015-04-24 10:26:53 -0600
0a48930ba3spaces in links fix. added gbssDiffbotUri to gbss docs.
Matt Wells
2015-04-24 10:23:07 -0600
e6a914d882Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-04-23 21:18:19 -0600
a81dcb6442fix slow spider proxy loop
2015-04-23 21:17:55 -0600
2aeb88e19bupdate search api doc
Matt Wells
2015-04-22 18:55:25 -0600
c3c7e757faMerge remote-tracking branch 'origin/diffbot' into diffbot-kevin
2015-04-22 17:45:19 -0600
4e9a8f351btry to fix a core from restarting a collection that was in the middle of dumping to disk.
Matt Wells
2015-04-22 16:07:16 -0700
0656cc4c72fix a core on seraph host #6
Matt Wells
2015-04-22 15:46:35 -0700
b0b26126a5fix parens bug for gbsortbyint:gbspiderdate) do not include ( or ) as part of the field value since they are associated with boolean syntax.
Matt Wells
2015-04-22 14:02:28 -0600
7462b0cd84gb -h fix
Matt Wells
2015-04-22 12:51:32 -0600
00661287damysyn fixes
2015-04-22 08:34:29 -0600
05fc660ef2fix love<->like syn mapping from wiktionary.
2015-04-21 20:58:33 -0600
a2feab9a4atap in some fixes for running the newly updated smokes for dealing with the new urls.csv format
Matt Wells
2015-04-21 15:20:57 -0700
1dd3912ca0default isr back on
2015-04-21 08:19:32 -0600
8e5f57d677take comments out
2015-04-21 08:14:54 -0600
a7640dadc1hop count bug fix when merging spiderdb lists and doing deduping. do not change hopcounts in spider request records.
2015-04-20 15:17:36 -0600
e05dde5934show the path depth of spidered urls in the logs
2015-04-19 16:17:30 -0600
644ad28912debugging the hopcount bug
Matt Wells
2015-04-19 15:51:29 -0600
80f2584b5dmore new urls.csv fixes
Matt Wells
2015-04-15 18:38:29 -0700
25aab18870add crawl try # to urls.csv
2015-04-15 19:31:44 -0600
11ea50935duse new urls.csv only for GET /v3/crawl/download/token-collname_urls.csv version 3
2015-04-15 17:48:55 -0600
ef42a9cf28new urls.csv polish. moved columns around. added some new gbss fields, like spidered time.
2015-04-15 17:42:56 -0600
fec347a7dffix bug of partial facet counting.
2015-04-15 14:54:49 -0600
496124da39fix new urls.csv output
2015-04-15 12:53:43 -0600
3191980f49the new urls.csv format is ready. added url discovered time to gbssdocs so we know when we first found a url. also added to new urls.csv. fixed spiderdb list deduping so as not to discard the oldest spider request any more so we keep our discovered time in tact.
2015-04-15 12:13:27 -0600
f0f8f0a967Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-04-14 16:27:28 -0600
0c88ebba9bremoved buggy close least used linked list logic. was causing data corruption in reads and writes. go to urgent shutdown mode if on 10th try so gb will actually exit. do not startup if there is critical data corruption.
Matt Wells
2015-04-14 15:26:46 -0700
61af961dfduse m_sentToDiffbotThisTime in SpiderReply now too
Matt Wells
2015-04-14 15:23:12 -0700
99454bc8caadded gbssSentToDiffbotThisTime and gbssSentToDiffbotAtSomeTime to gbss docs to clarify if the url was sent to diffbot at this crawl time, or any time. makes it easier to see what is getting processed this crawl round.
Matt Wells
2015-04-14 14:50:39 -0700
a92b158bc7gbss doc oopsy fixes
2015-04-13 16:47:41 -0600
040a604ec6xmldoc back to o2
2015-04-13 14:49:44 -0600
497131d359fix gbssdocid bug better
2015-04-13 14:33:57 -0600
3e5218c54cfix gbssDocId:123456789, et al, query. will only work for docs indexed after applying this fix.
2015-04-13 14:13:16 -0600
31ac1fa2b0quick fix for spider status
Matt Wells
2015-04-13 12:16:55 -0700
8d1e67be0athe diffbot objects we index as their own separate doc should inherit the hopcount from the parent html doc.
Matt Wells
2015-04-13 12:12:43 -0700
4a32c8308eMerge branch 'diffbot-testing' into diffbot
Matt Wells
2015-04-13 12:08:06 -0700
614e9215cdupdate getSpiderStatusMsg() to always set *status. always show diffbotreply when doing crawlbottesting
Matt Wells
2015-04-13 12:06:22 -0700
9f836dbf75fix corruption of s_vbuf (gb version) in the hosts table.
2015-04-13 11:13:44 -0600
4a43e1387ebetter fixes for core from sig alarms
2015-04-13 10:28:43 -0600
48ac8bf80ffix udp linked list thing again
2015-04-13 10:13:59 -0600
e6696a6937fix some more
2015-04-13 10:08:01 -0600
9feb070fe9fix issue of not being able to exit gb when a disk read retry is taking forever.
2015-04-13 10:06:08 -0600
f5a7423336fix bug of never calling callback
2015-04-13 09:56:21 -0600
47ed2a57eeupdate log msg
Matt Wells
2015-04-13 07:49:57 -0700
43ced700d0calls NEWS BLOG
2015-04-12 12:33:09 -0600
2814e3db37show screenshots
2015-04-12 11:52:22 -0600
994ba73007log more when doing crawlbottesting-* tests
Matt Wells
2015-04-12 06:46:33 -0700
56a46fd294fix printing of facets when &header=0 so diffbot json output is still simple and correct.
2015-04-10 16:25:38 -0600
a891fb7bdcturn on indexing spider status docs for all diffbot CRAWLS on startup, whether it was off or on before.
Matt Wells
2015-04-10 14:35:41 -0700
4dce44c976fix log msg
Matt Wells
2015-04-10 13:31:46 -0700
02aa138fbbdebug helper msg
Matt Wells
2015-04-10 13:27:06 -0700
13d0361756try to speed up host #4 on seraph
2015-04-10 09:20:18 -0600
6b139e9eeeclarify jam ups
Matt Wells
2015-04-08 18:30:27 -0700
e38cb8c080Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-04-08 18:08:10 -0700
a7e48515fafix core when adding gbss for a force deleted doc.
Matt Wells
2015-04-08 18:07:32 -0700
64bae224e0fix core on the GI
Matt Wells
2015-04-08 16:05:32 -0600
97d3b185c1just use INCOMING udp slots/sockets for jam detection. this will highlight the slow nodes better.
Matt Wells
2015-04-08 15:52:43 -0600
7fd4310106don't include gbss headers if not gbss documents
Matt Wells
2015-04-07 22:05:11 -0700
2997b3bb28fix for skipping dead shards on tag re clookup
2015-04-07 14:36:47 -0600
53a2d39afdfix for calling callback of timedout udp slots
2015-04-07 14:18:42 -0600
2114c40cdafix not calling callback when udp reply times out. like for msg39 replies we need to timeout quickly.
Matt Wells
2015-04-07 12:38:35 -0700
05a66cc367fix bug of not able to get ip address because peeksize is too big.
Matt Wells
2015-04-07 12:29:19 -0700
b08d12a11efix cores associated with new spider status docs.
Matt Wells
2015-04-07 10:33:54 -0700
4ed8231222upp max vfds again
Matt Wells
2015-04-07 09:56:04 -0600
036fb4e0dccrap, another oopsy fix
2015-04-06 14:43:58 -0700
bc3335c434more facet counting fixes
2015-04-06 14:38:38 -0700
74fe3c5866more facet counting fixes
2015-04-06 14:02:59 -0700
1a262c8254fixed oopsy
2015-04-06 13:51:19 -0700
8326460e8ffix counting of # docs that have facet field.
Matt Wells
2015-04-06 14:41:44 -0600
330d9a9dbfreport max rounds reached, not max to process or crawl reached.
Matt Wells
2015-04-06 10:24:33 -0600
bffaa09599fix for the GI
2015-04-06 08:24:00 -0600
de187dbb2bdocumentation fix
2015-04-03 16:00:04 -0600
8433c49aa9make sure we index a spider status doc for each diffbot object. that way we can tell if diffbot objects are deduping, how they are changing over time, etc.
2015-04-03 14:59:09 -0600
dad1cb15f4fix excessive looping when calling makeCallbacks() on niceness 1 or above when none are available.
2015-04-03 12:12:58 -0600
c991a2dcddtry to ameliorate the udp slot jamming issue.
2015-04-03 10:43:11 -0600
c2567ad244a hopeful fix for host #0 always crashing from streaming socket timeouts.
2015-04-02 15:17:49 -0600
2ce107e4bekeep track of how many times the host exited/cored as an exponent to the 'x' in the hosts table. this way we can detect hosts that have restarted many times and fix them.
2015-04-01 16:28:58 -0600
e583850e40fix core when searching bogus collection.
2015-04-01 15:30:59 -0600
94a8210586added CSV to output dropdown. show all json fields for spider status doc csv files. support spider status docs in csv output.
2015-04-01 13:53:03 -0600
f26c9d609bone more qa test fix for spider status docs
2015-04-01 12:47:32 -0600
5e46262cb2more fixes for qa'ing of new spider status docs
2015-04-01 12:03:17 -0600
10a31783bbfixes to pass internal qa tests in light of gbss (spider status doc) changes and other things. had to make xmldoc.o -O2 instead of -O3 to fix strange bug.
2015-04-01 11:20:36 -0600
6b293f17e6now show "totalDocsWithField" for each facet, so we know how many docs had that field, with any particular value, so we can do tf/idf type things.
2015-04-01 09:16:42 -0600
47f6d9f414clean out rebuild trees/buckets too
2015-03-21 22:42:49 -0600
e99b2f0a65added RdbBuckets::cleanBuckets() corresponding to RdbTree::cleanTree() to remove keys from deleted collections at startup.
2015-03-21 22:28:34 -0600
000c5d67e9do not index xml docs' body for custom crawls or when indexbody is turned off.
2015-03-21 09:21:44 -0600
9f42a6d5fffix indexing of spider status docs
Matt Wells
2015-03-20 18:08:39 -0700
7d82a5ca69try to get diffbot reply info first before making spider status doc
Matt Wells
2015-03-20 17:44:21 -0700
07d13541edemergency fixes for corrupt tagdb tag id
Matt Wells
2015-03-20 17:21:52 -0700
62bbede498try to fix strange unknown tagid core
Matt Wells
2015-03-20 14:20:40 -0700
7343c40e98fix a couple status doc fields.
Matt Wells
2015-03-20 12:51:36 -0600
1a80cb1b5dMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2015-03-20 12:45:24 -0600
7de9f6940bdocumentation for new gbss spider status doc fields.
2015-03-20 12:43:21 -0600
70bdc97bfdupdate README.md
2015-03-19 23:31:09 -0600
afa14e84a1fixes for generating spider status docs. and displaying them.
2015-03-19 23:24:36 -0600
fc0b3e7743url filters fixes.
2015-03-19 18:17:26 -0600
ed5fe6d284more bug fixes from 'delete' column addition to url filters.
2015-03-19 18:05:36 -0600
4b1dcfa068fix isrssext bug
2015-03-19 17:28:19 -0600
90456222b6now we add the spider status docs as json documents. so you can facet/sortby the various fields, etc.
2015-03-19 16:17:36 -0600