Commit Graph

  • e254118b5f print numbers as strings when printing the csv Matt Wells 2015-04-24 10:26:53 -0600
  • 0a48930ba3 spaces in links fix. added gbssDiffbotUri to gbss docs. Matt Wells 2015-04-24 10:23:07 -0600
  • e6a914d882 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-04-23 21:18:19 -0600
  • a81dcb6442 fix slow spider proxy loop Matt 2015-04-23 21:17:55 -0600
  • 2aeb88e19b update search api doc Matt Wells 2015-04-22 18:55:25 -0600
  • c3c7e757fa Merge remote-tracking branch 'origin/diffbot' into diffbot-kevin Matt 2015-04-22 17:45:19 -0600
  • 4e9a8f351b try to fix a core from restarting a collection that was in the middle of dumping to disk. Matt Wells 2015-04-22 16:07:16 -0700
  • 0656cc4c72 fix a core on seraph host #6 Matt Wells 2015-04-22 15:46:35 -0700
  • b0b26126a5 fix parens bug for gbsortbyint:gbspiderdate) do not include ( or ) as part of the field value since they are associated with boolean syntax. Matt Wells 2015-04-22 14:02:28 -0600
  • 7462b0cd84 gb -h fix Matt Wells 2015-04-22 12:51:32 -0600
  • 00661287da mysyn fixes Matt 2015-04-22 08:34:29 -0600
  • 05fc660ef2 fix love<->like syn mapping from wiktionary. Matt 2015-04-21 20:58:33 -0600
  • a2feab9a4a tap in some fixes for running the newly updated smokes for dealing with the new urls.csv format Matt Wells 2015-04-21 15:20:57 -0700
  • 1dd3912ca0 default isr back on mwells 2015-04-21 08:19:32 -0600
  • 8e5f57d677 take comments out mwells 2015-04-21 08:14:54 -0600
  • a7640dadc1 hop count bug fix when merging spiderdb lists and doing deduping. do not change hopcounts in spider request records. mwells 2015-04-20 15:17:36 -0600
  • e05dde5934 show the path depth of spidered urls in the logs mwells 2015-04-19 16:17:30 -0600
  • 644ad28912 debugging the hopcount bug Matt Wells 2015-04-19 15:51:29 -0600
  • 80f2584b5d more new urls.csv fixes Matt Wells 2015-04-15 18:38:29 -0700
  • 25aab18870 add crawl try # to urls.csv Matt 2015-04-15 19:31:44 -0600
  • 11ea50935d use new urls.csv only for GET /v3/crawl/download/token-collname_urls.csv version 3 Matt 2015-04-15 17:48:55 -0600
  • ef42a9cf28 new urls.csv polish. moved columns around. added some new gbss fields, like spidered time. Matt 2015-04-15 17:42:56 -0600
  • fec347a7df fix bug of partial facet counting. Matt 2015-04-15 14:54:49 -0600
  • 496124da39 fix new urls.csv output Matt 2015-04-15 12:53:43 -0600
  • 3191980f49 the new urls.csv format is ready. added url discovered time to gbssdocs so we know when we first found a url. also added to new urls.csv. fixed spiderdb list deduping so as not to discard the oldest spider request any more so we keep our discovered time in tact. Matt 2015-04-15 12:13:27 -0600
  • f0f8f0a967 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-04-14 16:27:28 -0600
  • 0c88ebba9b removed buggy close least used linked list logic. was causing data corruption in reads and writes. go to urgent shutdown mode if on 10th try so gb will actually exit. do not startup if there is critical data corruption. Matt Wells 2015-04-14 15:26:46 -0700
  • 61af961dfd use m_sentToDiffbotThisTime in SpiderReply now too Matt Wells 2015-04-14 15:23:12 -0700
  • 99454bc8ca added gbssSentToDiffbotThisTime and gbssSentToDiffbotAtSomeTime to gbss docs to clarify if the url was sent to diffbot at this crawl time, or any time. makes it easier to see what is getting processed this crawl round. Matt Wells 2015-04-14 14:50:39 -0700
  • a92b158bc7 gbss doc oopsy fixes Matt 2015-04-13 16:47:41 -0600
  • 040a604ec6 xmldoc back to o2 Matt 2015-04-13 14:49:44 -0600
  • 497131d359 fix gbssdocid bug better Matt 2015-04-13 14:33:57 -0600
  • 3e5218c54c fix gbssDocId:123456789, et al, query. will only work for docs indexed after applying this fix. Matt 2015-04-13 14:13:16 -0600
  • 31ac1fa2b0 quick fix for spider status Matt Wells 2015-04-13 12:16:55 -0700
  • 8d1e67be0a the diffbot objects we index as their own separate doc should inherit the hopcount from the parent html doc. Matt Wells 2015-04-13 12:12:43 -0700
  • 4a32c8308e Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-04-13 12:08:06 -0700
  • 614e9215cd update getSpiderStatusMsg() to always set *status. always show diffbotreply when doing crawlbottesting Matt Wells 2015-04-13 12:06:22 -0700
  • 9f836dbf75 fix corruption of s_vbuf (gb version) in the hosts table. Matt 2015-04-13 11:13:44 -0600
  • 4a43e1387e better fixes for core from sig alarms Matt 2015-04-13 10:28:43 -0600
  • 48ac8bf80f fix udp linked list thing again Matt 2015-04-13 10:13:59 -0600
  • e6696a6937 fix some more Matt 2015-04-13 10:08:01 -0600
  • 9feb070fe9 fix issue of not being able to exit gb when a disk read retry is taking forever. Matt 2015-04-13 10:06:08 -0600
  • f5a7423336 fix bug of never calling callback Matt 2015-04-13 09:56:21 -0600
  • 47ed2a57ee update log msg Matt Wells 2015-04-13 07:49:57 -0700
  • 43ced700d0 calls NEWS BLOG Matt 2015-04-12 12:33:09 -0600
  • 2814e3db37 show screenshots Matt 2015-04-12 11:52:22 -0600
  • 994ba73007 log more when doing crawlbottesting-* tests Matt Wells 2015-04-12 06:46:33 -0700
  • 56a46fd294 fix printing of facets when &header=0 so diffbot json output is still simple and correct. mwells 2015-04-10 16:25:38 -0600
  • a891fb7bdc turn on indexing spider status docs for all diffbot CRAWLS on startup, whether it was off or on before. Matt Wells 2015-04-10 14:35:41 -0700
  • 4dce44c976 fix log msg Matt Wells 2015-04-10 13:31:46 -0700
  • 02aa138fbb debug helper msg Matt Wells 2015-04-10 13:27:06 -0700
  • 13d0361756 try to speed up host #4 on seraph Matt 2015-04-10 09:20:18 -0600
  • 6b139e9eee clarify jam ups Matt Wells 2015-04-08 18:30:27 -0700
  • e38cb8c080 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-04-08 18:08:10 -0700
  • a7e48515fa fix core when adding gbss for a force deleted doc. Matt Wells 2015-04-08 18:07:32 -0700
  • 64bae224e0 fix core on the GI Matt Wells 2015-04-08 16:05:32 -0600
  • 97d3b185c1 just use INCOMING udp slots/sockets for jam detection. this will highlight the slow nodes better. Matt Wells 2015-04-08 15:52:43 -0600
  • 7fd4310106 don't include gbss headers if not gbss documents Matt Wells 2015-04-07 22:05:11 -0700
  • fea5210906 fix infinite loop bug mwells 2015-04-07 15:27:26 -0600
  • 2997b3bb28 fix for skipping dead shards on tag re clookup mwells 2015-04-07 14:36:47 -0600
  • 53a2d39afd fix for calling callback of timedout udp slots mwells 2015-04-07 14:18:42 -0600
  • 2114c40cda fix not calling callback when udp reply times out. like for msg39 replies we need to timeout quickly. Matt Wells 2015-04-07 12:38:35 -0700
  • 05a66cc367 fix bug of not able to get ip address because peeksize is too big. Matt Wells 2015-04-07 12:29:19 -0700
  • b08d12a11e fix cores associated with new spider status docs. Matt Wells 2015-04-07 10:33:54 -0700
  • 4ed8231222 upp max vfds again Matt Wells 2015-04-07 09:56:04 -0600
  • 036fb4e0dc crap, another oopsy fix Matt 2015-04-06 14:43:58 -0700
  • bc3335c434 more facet counting fixes Matt 2015-04-06 14:38:38 -0700
  • 74fe3c5866 more facet counting fixes Matt 2015-04-06 14:02:59 -0700
  • 1a262c8254 fixed oopsy Matt 2015-04-06 13:51:19 -0700
  • 8326460e8f fix counting of # docs that have facet field. Matt Wells 2015-04-06 14:41:44 -0600
  • 330d9a9dbf report max rounds reached, not max to process or crawl reached. Matt Wells 2015-04-06 10:24:33 -0600
  • bffaa09599 fix for the GI Matt 2015-04-06 08:24:00 -0600
  • de187dbb2b documentation fix Matt 2015-04-03 16:00:04 -0600
  • 8433c49aa9 make sure we index a spider status doc for each diffbot object. that way we can tell if diffbot objects are deduping, how they are changing over time, etc. Matt 2015-04-03 14:59:09 -0600
  • dad1cb15f4 fix excessive looping when calling makeCallbacks() on niceness 1 or above when none are available. Matt 2015-04-03 12:12:58 -0600
  • c991a2dcdd try to ameliorate the udp slot jamming issue. Matt 2015-04-03 10:43:11 -0600
  • c2567ad244 a hopeful fix for host #0 always crashing from streaming socket timeouts. Matt 2015-04-02 15:17:49 -0600
  • 2ce107e4be keep track of how many times the host exited/cored as an exponent to the 'x' in the hosts table. this way we can detect hosts that have restarted many times and fix them. Matt 2015-04-01 16:28:58 -0600
  • e583850e40 fix core when searching bogus collection. Matt 2015-04-01 15:30:59 -0600
  • 94a8210586 added CSV to output dropdown. show all json fields for spider status doc csv files. support spider status docs in csv output. Matt 2015-04-01 13:53:03 -0600
  • f26c9d609b one more qa test fix for spider status docs Matt 2015-04-01 12:47:32 -0600
  • 5e46262cb2 more fixes for qa'ing of new spider status docs Matt 2015-04-01 12:03:17 -0600
  • 10a31783bb fixes to pass internal qa tests in light of gbss (spider status doc) changes and other things. had to make xmldoc.o -O2 instead of -O3 to fix strange bug. Matt 2015-04-01 11:20:36 -0600
  • 6b293f17e6 now show "totalDocsWithField" for each facet, so we know how many docs had that field, with any particular value, so we can do tf/idf type things. Matt 2015-04-01 09:16:42 -0600
  • 47f6d9f414 clean out rebuild trees/buckets too mwells 2015-03-21 22:42:49 -0600
  • e99b2f0a65 added RdbBuckets::cleanBuckets() corresponding to RdbTree::cleanTree() to remove keys from deleted collections at startup. Matt 2015-03-21 22:28:34 -0600
  • 000c5d67e9 do not index xml docs' body for custom crawls or when indexbody is turned off. Matt 2015-03-21 09:21:44 -0600
  • 9f42a6d5ff fix indexing of spider status docs Matt Wells 2015-03-20 18:08:39 -0700
  • 7d82a5ca69 try to get diffbot reply info first before making spider status doc Matt Wells 2015-03-20 17:44:21 -0700
  • 07d13541ed emergency fixes for corrupt tagdb tag id Matt Wells 2015-03-20 17:21:52 -0700
  • 62bbede498 try to fix strange unknown tagid core Matt Wells 2015-03-20 14:20:40 -0700
  • 7343c40e98 fix a couple status doc fields. Matt Wells 2015-03-20 12:51:36 -0600
  • 1a80cb1b5d Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing mwells 2015-03-20 12:45:24 -0600
  • 7de9f6940b documentation for new gbss spider status doc fields. mwells 2015-03-20 12:43:21 -0600
  • 70bdc97bfd update README.md Matt 2015-03-19 23:31:09 -0600
  • afa14e84a1 fixes for generating spider status docs. and displaying them. mwells 2015-03-19 23:24:36 -0600
  • fc0b3e7743 url filters fixes. Matt 2015-03-19 18:17:26 -0600
  • ed5fe6d284 more bug fixes from 'delete' column addition to url filters. Matt 2015-03-19 18:05:36 -0600
  • 4b1dcfa068 fix isrssext bug Matt 2015-03-19 17:28:19 -0600
  • 90456222b6 now we add the spider status docs as json documents. so you can facet/sortby the various fields, etc. Matt 2015-03-19 16:17:36 -0600